Correct values for SsaSpikeEstimator's pvalueHistoryLength - ml.net

In the creation of a SsaSpikeEstimator instance by the DetectSpikeBySsa method, there is a parameter called pvalueHistoryLength - could anybody please help me understand, for any given time series with X points, which is the optimal value for this parameter?

I got similar issue, when I try to read the paper, https://arxiv.org/pdf/1206.6910.pdf, I notice one paragraph
Also, simulations and theory (Golyandina, 2010) show that it is
better to choose window length L smaller than half of the time series length
N. One of the recommended values is N/3.
Maybe that's why in the ML.Net Power Anomaly example, the value is chosen to be 30 for the 90 points dataset.

Related

How to automatically feed a cell value from a range of values, based on its matching condition with other cell value

I'm making a time-spending tracker based on the work I do every hour of the day.
Now, suppose I have 28 types of work listed in my tracker (which I also have to increase from time to time), and I have about 8 significance values that I have decided to relate to these 28 types of work, predefined.
I want that, as soon as I enter a type of work in cell 1 - I want the adjacent cell 2 to get automatically populated with a significance value (from a range of 8 values) that is pre-definitely set by me.
Every time I input a new or old occurrence of a type of work, the adjacent cell should automatically get matched with its relevant significance value & automatically get populated in real-time.
I know how to do it using IF, IFS, and IF_OR conditions, but I feel that based on the ever-expanding types of work & significance values, the above formulas will be very big, complicated, and repetitive in the future. I feel there's a more efficient way to achieve it. Also, I don't want it to be selected from a drop-down list.
Guys, please help me out with the most efficient way to handle this. TUIA :)
Also, I've added a snapshot and a sample sheet describing the problem.
Sample sheet
XLOOKUP() may work. Try-
=XLOOKUP(D2,A2:A,B2:B)
Or FILTER() function like-
=FILTER(B2:B,A2:A=D2)
You can use this formula for a whole column:
=INDEX(IFERROR(VLOOKUP(C14:C,A2:B9,2,0)))
Adapt the ranges to your actual tables in order to include in the second argument all the potential values and their significances
This is the formula, that worked for me (for anybody's reference):
I created another reference sheet, stating the types of work & their significance. From that sheet, I'm using either vlookup, filter, xlookup.Using gforms for inputting my data.
=ARRAYFORMULA(IFS(ROW(D:D)=1,"Significance",A:A="","",TRUE,VLOOKUP(D:D,Reference!$A:$B,2,0)))

How to normalize sequence of numbers?

I am working user behavior project. Based on user interaction I have got some data. There is nice sequence which smoothly increases and decreases over the time. But there are little discrepancies, which are very bad. Please refer to graph below:
You can also find data here:
2.0789 2.09604 2.11472 2.13414 2.15609 2.17776 2.2021 2.22722 2.25019 2.27304 2.29724 2.31991 2.34285 2.36569 2.38682 2.40634 2.42068 2.43947 2.45099 2.46564 2.48385 2.49747 2.49031 2.51458 2.5149 2.52632 2.54689 2.56077 2.57821 2.57877 2.59104 2.57625 2.55987 2.5694 2.56244 2.56599 2.54696 2.52479 2.50345 2.48306 2.50934 2.4512 2.43586 2.40664 2.38721 2.3816 2.36415 2.33408 2.31225 2.28801 2.26583 2.24054 2.2135 2.19678 2.16366 2.13945 2.11102 2.08389 2.05533 2.02899 2.00373 1.9752 1.94862 1.91982 1.89125 1.86307 1.83539 1.80641 1.77946 1.75333 1.72765 1.70417 1.68106 1.65971 1.64032 1.62386 1.6034 1.5829 1.56022 1.54167 1.53141 1.52329 1.51128 1.52125 1.51127 1.50753 1.51494 1.51777 1.55563 1.56948 1.57866 1.60095 1.61939 1.64399 1.67643 1.70784 1.74259 1.7815 1.81939 1.84942 1.87731
1.89895 1.91676 1.92987
I would want to smooth out this sequence. The technique should be able to eliminate numbers with characteristic of X and Y, i.e. error in mono-increasing or mono-decreasing.
If not eliminate, technique should be able to shift them so that series is not affected by errors.
What I have tried and failed:
I tried to test difference between values. In some special cases it works, but for sequence as presented in this the distance between numbers is not such that I can cut out errors
I tried applying a counter, which is some X, then only change is accepted otherwise point is mapped to previous point only. Here I have great trouble deciding on value of X, because this is based on user-interaction, I am not really controller of it. If user interaction is such that its plot would be a zigzag pattern, I am ending up with 'no user movement data detected at all' situation.
Please share the techniques that you are aware of.
PS: Data made available in this example is a particular case. There is no typical pattern in which numbers are going to occure, but we expect some range to be continuous with all the examples. Solution I am seeking is generic.
I do not know how much effort you want to involve in this problem but if you want theoretical guaranties,
topological persistence seems well adapted to your problem imho.
Basically with that method, you can filtrate local maximum/minimum by fixing a scale
and there are theoritical proofs that says that if you sampling is
close from your function, then you extracts correct number of maximums with persistence.
You can see these slides (mainly pages 7-9 to get the idea) to get an idea of the method.
Basically, if you take your points as a landscape and imagine a watershed starting from maximum height and decreasing, you have some picks.
Every pick has a time where it is born which is the time where it becomes emerged and a time where it dies which is when it merges with an higher pick. Now a persistence diagram pictures a point for every pick where its x/y coordinates are its time of birth/death (by assumption the first pick does not die and is not shown).
If a pick is a global maximal, then it will be further from the diagonal in the persistence diagram than a local maximum pick. To remove local maximums you have to remove picks close to the diagonal. There are fours local maximums in your example as you can see with the persistence diagram of your data (thanks for providing the data btw) and two global ones (the first pick is not pictured in a persistence diagram):
If you noise your data like that :
You will still get a very decent persistence diagram that will allow you to filter local maximum as you want :
Please ask if you want more details or references.
Since you can not decide on a cut off frequency, and not even on the filter you want to use, I would implement several, and let the user set the parameters.
The first thing that I thought of is running average, and you can see that there are so many things to set, to get different outputs.

C++, determine the part that have the highest zero crosses

I’m not specialist in signal processing. I’m doing simple processing on 1D signal using c++. I want really to know how I can determine the part that have the highest zero cross rate (highest frequency!). Is there a simple way or method to tell the beginning and the end of this part.
This image illustrate the form of my signal, and this image is what I need to do (two indexes of beginning and end)
Edited:
Actually I have no prior idea about the width of the beginning and the end, it's so variable.
I could calculate the number of zero crossing, but I have no idea how to define it's range
double calculateZC(vector<double> signals){
int ZC_counter=0;
int size=signals.size();
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
ZC_counter++;
}
}
return ZC_counter;
}
Here is a fairly simple strategy which might give you some point to start. The outline of the algorithm is as follows
Input: Vector of your data points {y0,y1,...}
Parameters:
Window size sigma.
A threshold 0<p<1 defining when to start looking for a region.
Output: The start- and endpoint {t0,t1} of the region with the most zero-crossings
I won't give any C++ code, but the method should be easy to implement. As example let us use the following function
What we desire is the region between about 480 and 600 where the zero density higher than in the front. First step in the algorithm is to calculate the positions of zeros. You can do this by what you already have but instead of counting, you store the values for i where you met a zero.
This will give you a list of zero positions
From this list (you can do this directly in the above for-loop!) you create a list having the same size as your input data which looks like {0,0,0,...,1,0,..,1,0,..}. Every zero-crossing position in your input data is marked with a 1.
The next step is to smooth this list with a smoothing filter of size sigma. Here, you can use what you like; in the simplest case a moving average or a Gaussian filter. The higher you choose sigma the bigger becomes your look around window which measures how many zero-crossings are around a certain point. Let me give the output of this filter together with the original zero positions. Note that I used a Gaussian filter of size 10 here
In a next step, you go through the filtered data find the maximum value. In this case it is about 0.15. Now you choose your second parameter which is some percentage of this maximum. Lets say p=0.6.
The final step is to go through the filtered data and when the value is greater than p you start to remember a new region. As soon as the value drops below p, you end this region and remember start and endpoint. Once you are finished walking through the data, you are left with a list of regions, each defined by a start and an endpoint. Now you choose the region with the biggest extend and you are done.
(Optionally, you could add the filter size to each end of the final region)
For the above example, I get 11 regions as follows
{{164,173},{196,205},{220,230},{241,252},{259,271},{278,290},
{297,309},{318,327},{341,350},{458,468},{476,590}}
where the one with the biggest extend is the last one {476,590}. The final result looks (with 1/2 filter region padding)
Conclusion
Please don't be discouraged by the length of my answer. I tried to explain everything in detail. The implementation is really just some loops:
one loop to create the zero-crossings list {0,0,..,1,0,...}
one nested loop for the moving average filter (or you use some library Gaussian filter). Here you can at the same time extract the maximum value
one loop to extract all regions
one loop to extract the largest region if you haven't already extracted it in the above step

CLI/C++ How to store more than 15 digit float number?

For a school project, I have a simple program, which compares 20x20 photos. I put 20 photos, and then i put 21th photo, which is compared to existing 20, and pops up the answer, which photo i did insert (or which one is most similar). The problem is, my teacher wanted me to use nearest neighbour algorithm, so i am counting distance from every photo. I got everything working, but the thing is, if photos are too similar, i got the problem with saying which one is closer to my one. For example i get these distances with 2 different photos (well, they are ALMOST the same):
0 distance: 1353.07982026191
1 distance: 1353.07982026191
It is 15 digits already, and i am using double type. I was reading that long double is the same. Is there any "easy" way to store numbers with more than 15 digits and do math on them?
I count distance using Euclidean distance
I just need to be more precise, or thats limit i probably wont pass here, and i should talk to my teacher i cant compare such similar photos?
I think you need this: gmplib.org
There's a guide how to install this library on this site too.
And here's article about floats: http://gmplib.org/manual/C_002b_002b-Interface-Floats.html#C_002b_002b-Interface-Floats
Maybe you could use an algebraic approach.
Let us assume that you are trying to calcuate if vector x is closer to a or b. What you need to calculate is the sign of
d2(x, a) - d2(x, b)
Which becomes (I'll omit some passages for brevity)
and then
Which only contains differences between values which should be very similar. Summing over such small values should yield a better precision than working on the aggregate.

Autocorrelation returns random results with mic input (using a high pass filter)

Sorry to ask a similar question to the one i asked before (FFT Problem (Returns random results)), but i've looked up pitch detection and autocorrelation and have found some code for pitch detection using autocorrelation.
Im trying to do pitch detection of a users singing. Problem is, it keeps returning random results. I've got some code from http://code.google.com/p/yaalp/ which i've converted to C++ and modified (below). My sample rate is 2048, and data size is 1024. I'm detecting pitch of both a sine wave and mic input. The frequency of the sine wave is 726.0, and its detecting it to be 722.950820 (which im ok with), but its detecting the pitch of the mic as a random number from around 100 to around 1050.
I'm now using a High pass filter to remove the DC offset, but it's not working. Am i doing it right, and if so, what else can i do to fix it? Any help would be greatly appreciated!
(Fixed)
Thanks,
Niall.
Edit: Changed the code to implement a high pass filter with a cutoff of 30hz (from What Are High-Pass and Low-Pass Filters?, can anyone tell me how to convert the low-pass filter using convolution to a high-pass one?) but it's still returning random results. Plugging it into a VST host and using VST plugins to compare spectrums isn't an option to me unfortunately.
Edit: Fixed, thanks for everyones help, but I never got it to work, now using new code.
I am no sound expert, but if you are sampling with 44100 (I guess samples per second) and use 1024 datapoints. You are working with about 1/40th of a second worth of data. I doesn't surprise me that the current pitch varies a lot, depending on which piece you pick. If you want to find the average or main pitch of a voice, I'd expect to need about 1second worth of data.
At 44.1 kHz sampling frequency, 1024 samples is only a little bit over 23 ms worth of data. Isn't it possible that this is simply insufficient data in order to compute the pitch of a human singer?
I mean, the sound I can make that lasts for 23 ms is probably not something I have a lot of pitch-control over; I would expect this kind of measurement to be done over slighly longer periods of time.
The problem is in your findBestCandidates() function:
Inside this function you access the 'inputs' array from 0 up to 'length - 1'.
When you call this function inside detectPitchCalculation() function 'inputs' is 'results' and 'length' is 'nHiPeriodInSamples'.
But 'results' is only allocated and filled up to 'nHiPeriodInSamples - nLowPeriodInSamples - 1'.
So if 'nLowPeriodInSamples' is greater 0 you access unallocated and random memory inside the findBestCandidates() function!
EDIT:
Another bug is that you fill each 'nResolution' entry of the 'results' array in detectPitchCalculation() function but access each entry in the findBestCandidates() function (via the 'inputs' argument). But since you call detectPitchCalculation() with a 'nResolution=1' this does not explain your specific problem...so I will look a little bit more. But it would definitely a problem if you call it with higher resolutions.
I don't see the problem in you code, but I'm no good in C. But I'd try the following to find the problem:
run with data where the result in known, e.g. with sin(x) as input
run it with small data size (e.g. 2)
Compare the results with known correct ones. You should be able to find those on the internet, or do them by hand.
If random means: same input, different output, you most probably have some bug in the initialisation of variables. Use a debugger and known input to check, that all variables, especially all elements of arrays are properly initialized.