Getting trends from raw data - data-mining

Let we have many data which looks like
chain of digits time
23 67 34 23 54 | 12:34
23 54 | 12:42
78 96 23 | 12:46
56 93 23 54 | 12:48
I need to found numbers chain trends (grow, fall, stable) . In my example it might be 23 54 or 23.
Also i want to found different corelations between trends. Data is very big. It might be billions rows. Can you suggest any books articles or algorithms? Note i need information only about trends and corelations in such data type. I donnt need basic data mining books.

Here's the grain of an algorithm. It certainly isn't flushed out or tested, and it may not be complete. I'm just throwing it out here as a possible starting point.
It seems the most challenging issue is time required to run the algorithm over billions of rows, followed perhaps by memory limitations.
I also believe the fundamental task involved in solving this problem lies in the single operation of "comparing one set of numbers with another" to locate a shared set.
Therefore, might I suggest the following (rough) approach, in order to tackle both time, and memory:
(1) Consolidate multiple sets into a single, larger set.
i.e., take 100 consecutive sets (in your example, 23, 67, 34, 23, 54, 23, 54, 78, 96, 23, and the following 97 sets), and simply merge them together into a single set (ignoring duplicates).
(2) Give each *consolidated* set from (1) a label (or index),
and then map this set (by its label) to the original sets that compose it.
In this way, you will be able to retrieve (look up) the original individual sets 23, 67, 34, 23, 54, etc.
(3) The data is now denormalized - there are a much smaller number of sets, and each set is much larger.
Now, the algorithm moves onto a new stage.
(4) Develop an algorithm to look for matching sequences between any two of these larger sets.
There will be many false positives; however, hopefully the nature of your data is that the false positives will not "ruin" the efficiency that is gained by this approach.
I don't provide an algorithm to perform the matching between 2 individual sets here; I assume that you can come up with one yourself (sort both the sets, etc.).
(5) For every possible matching sequence found in (4), iterate through the individual sets that compose
the two larger sets being compared, weeding out false positives.
I suspect that the above step could be optimized significantly, but this is the basic idea.
At this point, you will have all of the matching sequences between all original sets that compose the two larger sets being compared.
(6) Execute steps (4) and (5) for every pair of large sets constructed in (2).
Now, you will have ALL matching sequences - with duplicates.
(7) Remove duplicates from the set of matching sequences.
Just a thought.

Related

QR code generation algorithm data masking implementation case analysis

I'm implementing a QR code generation algorithm as explained on thonky.com and I'm trying to understand one of the cases:
As stated on this page, after getting the percentage of the dark modules out of the whole code, I should take the two nearest multiples of five (for example 45 and 50 for 48%), but what if the percentage is a multiple of 5? for example 45.0? what numbers should be taken? 45? 40 and 50? 45 and 40? 45 and 50? something totally different? I couldn't find any answer to that anywhere...
Thank you very much in advance for the help!
Indeed the Thonky tutorial is unclear in this respect, so let's turn to the official standard (behind a paywall at ISO but easy to find online). Section 8.8.2, page 52, Table 24:
Evaluation condition: 50 ± (5 × k)% to 50 ± (5 × (k + 1))%
Points: N₄ × k
Here, N₄ = 10, and
k is the rating of the deviation of the proportion of dark modules in the symbol from 50% in steps of 5%.
So for for exactly 45% dark modules, you'd have k = 1, resulting in a penalty of 10 points.
Also note that it doesn't really matter if you get this slightly wrong. Because the mask pattern identifier is encoded in the format string, a reader can still decode the QR code even if you accidentally chose a slightly suboptimal mask pattern.

Loading first few observations of data set without reading entire data set (Stata 13.1)?

(Stata/MP 13.1)
I am working with a set of massive data sets that takes an extremely long time to load. I am currently looping through all the data sets to load them each time.
Is it possible to just tell Stata to load in the first 5 observations of each dataset (or in general the first n data sets in each use command) without actually having to load the entire data set? Otherwise, if I were to load in the entire data set and then just keep the first 5 observations, the process takes extremely long time.
Here are two work-arounds I have already tried
use in 1/5 using mydata : I think this is more efficient than just loading the data and then keeping the observations you want in a different line, but I think it still reads in the entire data set.
First load all the data sets, then save copies of all the data sets to just be the first 5 observations, and then just use the copies: This is cumbersome as I have a lot of different files; I would very much prefer just a direct way to read in the first 5 observations without having to resort to this method and without having to read the entire data set.
I'd say using in is the natural way to do this in Stata, but testing shows
you are correct: it really makes no "big" difference, given the size of the data set. An example is (with 148,000,000 observations)
sysuse auto, clear
expand 2000000
tempfile bigfile
save "`bigfile'", replace
clear
timer on 1
use "`bigfile'"
timer off 1
clear
timer on 2
use "`bigfile'" in 1/5
timer off 2
timer list
timer clear
Resulting in
. timer list
1: 6.44 / 1 = 6.4400
2: 4.85 / 1 = 4.8480
I find that surprising since in seems really efficicient in other contexts.
I would contact Stata Tech support (and/or search around, including www.statalist.com) only to ask why in isn't much faster
(independently of you finding some other strategy to handle this problem).
It's worth using, of course; but not fast enough for many applications.
In terms of workflow, your second option might be the best. Leave the computer running while the smaller datasets are created (use a for loop), and get back to your regular coding/debugging once that's finished. This really depends on what you're doing, so it may work or not.
Actually, I found the solution. If you run
use mybigdata if runiform() <= 0.0001
Stata will take a random sample of 0.0001 of the data set without reading the entire data set.
Thanks!
Vincent
Edit: 4/28/2015 (1:58 PM EST)
My apologies. It turns out the above was actually not a solution to the original question. It seems that on my system there was high variability in the speed of using
use mybigdata if runiform() <= 0.0001
each time I ran it. When I posted that the above was a solution, I think when I ran the code, it just happened to be a faster instance. However, as I now am repeatedly running
use mybigdata if runiform() <= 0.0001
vs.
use in 1/5 using mydata
I am actually finding that
use in 1/5 using mydata
is on average faster.
In general, my question is simply how to read in a portion of a Stata data set without having to read in the entire data set for computational purposes especially when the data set is really large.
Edit: 4/28/2015 (2:50 PM EST)
In total, I have 20 datasets, each with between 5 - 15 million observations. I only need to keep 8 of the variables (There are 58-65 variables in each data set). Below is the output from the first four "describe, short" statements.
2004 action1
Contains data from 2004action1.dta
obs: 15,039,576
vars: 64 30 Oct 2014 17:09
size: 2,827,440,288
Sorted by:
2004 action2578
Contains data from 2004action2578.dta
obs: 13,449,087
vars: 59 30 Oct 2014 17:16
size: 2,098,057,572
Sorted by:
2005 action1
Contains data from 2005action1.dta
obs: 15,638,296
vars: 65 30 Oct 2014 16:47
size: 3,143,297,496
Sorted by:
2005 action2578
Contains data from 2005action2578.dta
obs: 14,951,428
vars: 59 30 Oct 2014 17:03
size: 2,362,325,624
Sorted by:
Thanks!
Vincent

Fast way to convert strings to numbers on large dataset

I have a data set with tens of millions of rows. Several columns on this data represent categorical features. Each level of these features is represented by an alpha-numeric string like "b009d929".
C1 C2 C3 C4 C5 C6 C7
68fd1e64 80e26c9b fb936136 7b4723c4 25c83c98 7e0ccccf de7995b8 ...
68fd1e64 f0cf0024 6f67f7e5 41274cd7 25c83c98 fe6b92e5 922afcc0
I'd like to to be able to use Python to map each distinct level to a number to save memory. So that feature C1's levels would be replaced by numbers from 1 to C1_n, C2's levels would be replaced by numbers from 1 to C2_n...
Each feature has different number of levels, ranging from under 10 to 10k+.
I tried dictionaries with Pandas' .replace() but it gets extremely slow.
What is a fast way to approach this problem?
I figured out that the categorical features values were hashed onto 32 bits. So I ended up reading the file in chunks and applying this simple function
int(categorical_feature_value, 16)

Optimization run assistance

I am running the optimisation of two sets of data against each other and am after some assistance as to looking up settings of the run based on the calculated results. I'll explain....
I run 2 data lines against each other (think graph lines) - Line A and Line B. These lines have crossing points - upward and downward based on the direction of each line.e.g. Line A is going up and Line B is going down is an 'Upwards cross' and Line A going down and Line B going up is a 'Downward cross'.The program calculates financial analysis.
I analyze the crossing points and gain a resultant 'Rank' from the analysis based on a set of rules. The rank is a single integer.
Line A has a number of settings for the optimisation run e.g. Window 1 from a value of 10 to 20 and window 2 at a value of 30 to 40. Line B also has settings.
When I run the optimisation I iterate through the parameters available for each line and calculate the rank. The result of the optimisation run is a list of the ranks which is the size of the number of permutations avaliable.
So my question is this:
What is the best way to look up the line settings from the calculated rank using a position (index) in the rank list. The optimisation settings used to create the run will be stored for that rank run and can be used for the look-up.
I also will be adding additional parameters in the future to the system for the line so I want the program to take into account additional future line settings without affecting any rank files created previous to adding the new parameter.
In addition to that I want to be able to find out an index based on a particular setting included in the optimisation run (the reverse look-up of the previous method).
I want to avoid versioning for backward compatability if at all possible so that the lookup algorithm will be self-sufficient.
Is a hash table suitable for this purpose or do you have any implementation techniques that would fit better? Do you have any examples of this type of operation in action in C++?
Thanks,
Chris.
If I understand correctly, you have a bunch of associated data (settings + rank), on which you would like to be able to perform lookups with different key types. If so, then Boost.MultiIndex sounds like what you're looking for.

Calculating the mean for a set of numbers while neglecting outliers

First of all this is more of a math question than it is a coding one, so please be patient.
I am trying to figure out an algorithm to calculate the mean for a set of numbers. However I need to neglect any numbers that are not close to the majority of the results. Here is an example of what I am trying to do:
Lets say I have a set of numbers that are similar to the following:
{ 90, 91, 92, 95, 2, 3, 99, 92, 92, 91, 300, 91, 92, 99, 400 }
it is clear for the set above that the majority of numbers lies between 90 and 99, however I have some outliers like { 300, 400, 2, 3 }. I need to calculate the mean of those numbers while neglecting the outliers. I do remember reading about something like that in a statistics class but I cant remember what was it or how to approach the solution.
Will appreciate any help..
Thanks
What you could do is:
estimate the percentage of outliers in your data: about 25% (4/15) of the provided dataset,
compute the adequate quantiles: 8-quantiles for your dataset, so as to exclude the outliers,
estimate the mean between the first and the last quantile.
PS: Outliers constituting 25% of your dataset is a lot!
PPS: For the second step, we assumed outliers are "symmetrically distributed". See the graph below, where we use 4-quantiles and 1.5 times the interquartile range (IQR) from Q1 and Q3:
First you need to determine the standard deviation and mean of the full set. The outliers are those values that are greater than 3 standard deviations from the (full set) mean.
A simple method that works well is to take the median instead of the average. The median is far more robust to outliers.
You could also minimize a Geman-McClure function:
x^ = argmin sum( G(xi - x')), where G(x) = x^2/(x^2+sigma^2)
If you plot the G function, you will find that it saturates, which is a good way of softly excluding outliers.
I'd be very careful about this. You could be doing yourself and your conclusions a great disservice.
How is your program supposed to recognize outliers? The normal distribution would say that 99.9% of the values fall within +/- three standard deviations of the mean, so you could calculate both for the unfiltered data, exclude the values that fall outside the assumed range, and recalculate.
However, you might be throwing away something significant by doing so. The normal distribution isn't sacred; outliers are far more common in real life than the normal distribution would suggest. Read Taleb's "Black Swan" to see what I mean.
Be sure you understand fully what you're excluding before you do so. I think it'd be far better to leave all the data points, warts and all, and come up with a good written explanation for them.
Another approach would be a use an alternate measure like median, which is less sensitive to outliers than mean. It's harder to calculate, though.