How to use uncertainties to weight residuals in a Savitzky-Golay filter. - python-2.7

Is there a way to incorporate the uncertainties on my data set into the result of the Savitzky Golay fit? Since I am not passing this information into the function, I asume that it is simply calcuating the 'best fit' via an unweighted least-squares process. I am currently working with data that has non-uniform uncertainty, and so the fit of the data could be improved by including the errors that I have for my main dataset.
The wikipedia page for the Savitzky-Golay filter suggests how I might go about alter the process of calculating the coefficients of the fit, and I am staring at the code for scipy.signal.savgol_filter, but I cannot get my head around what I need to adjust so that this will do what I want it to.
Are there any ready-made weighted SG filters floating about? I find it hard to believe that no-one else has ever needed this tool in Python, but maybe I have missed something.

Check out this Python module: https://github.com/surhudm/savitzky_golay_with_errors
This python script improves upon the traditional Savitzky-Golay filter
by accounting for errors or covariance in the data. The inputs and
arguments are all modelled after scipy.signal.savgol_filter
Matlab function sgolayfilt supports weights. Check the documentation.

Related

Output to Table instead of Graph in Stata

In Stata, is there a way to redirect the data that a command does into a table instead of a graph?
Example: if someone created a normal probability distribution of data with the pnorm var_name command, is there a way to redirect the data so that instead of appearing in a graph, it appears in a table?
To add to #Noobie's answer:
Different commands work in different ways. There's no better short summary.
What you can look out for includes
generate() options that produce new variables. (There is absolute rule that the options have this name, but that or a similar name is the most common single variety.)
Options that allow saving results to new datasets.
Saved results, especially those visible after return list or ereturn list. These can be quite elaborate, e.g. saving of matrices of counts after tabulate.
More broadly, Stata commands aren't functions! One characteristic of a function, as so named in many languages or programs, is that there is a result, with special cases where the result is void or null. There clearly are statistical programs which in broad terms hinge on calling functions which have results, and what you see displayed is often a side-effect of that. Stata commands don't work like that in the sense that the results of a program can be various. In the case of commands designed just to show something, the "result" may be a display. It's worth noting that Mata, which underlies and underpins Stata, is more recognisably a C-like language, with (e.g.) many matrix extensions, which is based on functions (and much else).
Yes and no. It really depends on the command you are using. You should look at the help files first.
For instance, pnorm does not allow that. You can create the data yourself using the formula for pnorm described in the help file, where the cumulative distribution at some point is plotted against the so-called plotting position.
Other Stata commands allow you to generate the points directly. This is the case for kdensity for instance.

C++ FIR noise filter

I'm digging up some info about filtering the noise out of my IQ data samples in C++.
I have learned that this can be done by using a simple filter which calculates the average of last few data samples and applies it to the current sample.
Do you have any further experience with this kind of filtering or do you recommend using some existing FIR filtering library?
Thanks for your comments!
Unfortunately, it is not as simple as "just get some library and it will do all the work for you"; digital filters is a quite complicated subject.
It is easy to apply digital filter to your data only if your measurements come at fixed time intervals (known as "sample rate" in digital filters). Otherwise (if time intervals vary), it is not trivial to apply digital filters (and I suspect you might need FFT to do it, but I might be wrong here).
Digital filters (both IIR and FIR) are interesting in that as soon as you know coefficients, you don't really need a library, it is easy to write it yourself (see, for example, first picture here: https://en.wikipedia.org/wiki/Finite_impulse_response : looks simple, right?); it is finding coefficients which is tricky.
As a prerequisite to find out coefficients, you need to understand quite a lot about filters: you need to know what kind of filter you need (if it is after demodulation - you'll likely need low-pass, otherwise see comment by MSalters below), you need to understand what "corner frequency" is, and you need to realize how to map those frequencies to your samples (for example, you can say that your samples are coming once per second - or at any other rate, but this choice will affect your desired "corner frequency"). As soon as you've got this understanding of "what you need in terms of digital filters" - finding coefficients is quite easy, you can do it either in MatLab, or using online calculator, look for "digital filter calculator" in Google.

Data mining with Weka

I am learning how to do data mining and I am using this data set from UCI's website.
http://archive.ics.uci.edu/ml/datasets/Forest+Fires
The problem I am encountering is how to deal with the area class. My understanding from the description is that I need to apply ln(x+1) to area using AddExpression.
Am I going in the correct direction with this? Or are there other filters I should investigate? Thank you.
I try to answer your question based on the little information you provide. And I haven't worked with the forest-fires data set, but by inspection I see that the classifier attribute "area" often has the value 0. Maybe you can't simply filter out these rows with Area = 0. Your dataset might become too small, or whatnot.
I think you are asked to perform regression of some attribute(s) against "log(area)" in order to linearize it. However,when you try to calculate the log of the Area, values such as log(0) are a problem. values between 0 and 1 might also be problematic.
So a common fix is to add 1 to the value of "Area". This introduces a systematic error, but it is small, and it removes all 0-values, and you can still derive useful models from your log(x+1)-transformed dataset.
And yes, in Weka you do this by "Preprocess"/ AddExpression(x+1). This creates a new attribute. Then you might remove the old area attribute.
Of course, in interpreting your model, you should be aware of the transformation. If you just want to find out what the significant independent attributes are in your linear regression model, I'd say the transformation does not matter. The data points are just shifted a little bit.

How to normalize sequence of numbers?

I am working user behavior project. Based on user interaction I have got some data. There is nice sequence which smoothly increases and decreases over the time. But there are little discrepancies, which are very bad. Please refer to graph below:
You can also find data here:
2.0789 2.09604 2.11472 2.13414 2.15609 2.17776 2.2021 2.22722 2.25019 2.27304 2.29724 2.31991 2.34285 2.36569 2.38682 2.40634 2.42068 2.43947 2.45099 2.46564 2.48385 2.49747 2.49031 2.51458 2.5149 2.52632 2.54689 2.56077 2.57821 2.57877 2.59104 2.57625 2.55987 2.5694 2.56244 2.56599 2.54696 2.52479 2.50345 2.48306 2.50934 2.4512 2.43586 2.40664 2.38721 2.3816 2.36415 2.33408 2.31225 2.28801 2.26583 2.24054 2.2135 2.19678 2.16366 2.13945 2.11102 2.08389 2.05533 2.02899 2.00373 1.9752 1.94862 1.91982 1.89125 1.86307 1.83539 1.80641 1.77946 1.75333 1.72765 1.70417 1.68106 1.65971 1.64032 1.62386 1.6034 1.5829 1.56022 1.54167 1.53141 1.52329 1.51128 1.52125 1.51127 1.50753 1.51494 1.51777 1.55563 1.56948 1.57866 1.60095 1.61939 1.64399 1.67643 1.70784 1.74259 1.7815 1.81939 1.84942 1.87731
1.89895 1.91676 1.92987
I would want to smooth out this sequence. The technique should be able to eliminate numbers with characteristic of X and Y, i.e. error in mono-increasing or mono-decreasing.
If not eliminate, technique should be able to shift them so that series is not affected by errors.
What I have tried and failed:
I tried to test difference between values. In some special cases it works, but for sequence as presented in this the distance between numbers is not such that I can cut out errors
I tried applying a counter, which is some X, then only change is accepted otherwise point is mapped to previous point only. Here I have great trouble deciding on value of X, because this is based on user-interaction, I am not really controller of it. If user interaction is such that its plot would be a zigzag pattern, I am ending up with 'no user movement data detected at all' situation.
Please share the techniques that you are aware of.
PS: Data made available in this example is a particular case. There is no typical pattern in which numbers are going to occure, but we expect some range to be continuous with all the examples. Solution I am seeking is generic.
I do not know how much effort you want to involve in this problem but if you want theoretical guaranties,
topological persistence seems well adapted to your problem imho.
Basically with that method, you can filtrate local maximum/minimum by fixing a scale
and there are theoritical proofs that says that if you sampling is
close from your function, then you extracts correct number of maximums with persistence.
You can see these slides (mainly pages 7-9 to get the idea) to get an idea of the method.
Basically, if you take your points as a landscape and imagine a watershed starting from maximum height and decreasing, you have some picks.
Every pick has a time where it is born which is the time where it becomes emerged and a time where it dies which is when it merges with an higher pick. Now a persistence diagram pictures a point for every pick where its x/y coordinates are its time of birth/death (by assumption the first pick does not die and is not shown).
If a pick is a global maximal, then it will be further from the diagonal in the persistence diagram than a local maximum pick. To remove local maximums you have to remove picks close to the diagonal. There are fours local maximums in your example as you can see with the persistence diagram of your data (thanks for providing the data btw) and two global ones (the first pick is not pictured in a persistence diagram):
If you noise your data like that :
You will still get a very decent persistence diagram that will allow you to filter local maximum as you want :
Please ask if you want more details or references.
Since you can not decide on a cut off frequency, and not even on the filter you want to use, I would implement several, and let the user set the parameters.
The first thing that I thought of is running average, and you can see that there are so many things to set, to get different outputs.

Weka: Classifier and ReplaceMissingValues

I am relatively new to the data mining area and have been experimenting with Weka.
I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.
I want to find the missing gender values based on the other data I do have.
I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)
Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.
I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.
It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.
A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.
You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.