How to normalize sequence of numbers? - c++

I am working user behavior project. Based on user interaction I have got some data. There is nice sequence which smoothly increases and decreases over the time. But there are little discrepancies, which are very bad. Please refer to graph below:
You can also find data here:
2.0789 2.09604 2.11472 2.13414 2.15609 2.17776 2.2021 2.22722 2.25019 2.27304 2.29724 2.31991 2.34285 2.36569 2.38682 2.40634 2.42068 2.43947 2.45099 2.46564 2.48385 2.49747 2.49031 2.51458 2.5149 2.52632 2.54689 2.56077 2.57821 2.57877 2.59104 2.57625 2.55987 2.5694 2.56244 2.56599 2.54696 2.52479 2.50345 2.48306 2.50934 2.4512 2.43586 2.40664 2.38721 2.3816 2.36415 2.33408 2.31225 2.28801 2.26583 2.24054 2.2135 2.19678 2.16366 2.13945 2.11102 2.08389 2.05533 2.02899 2.00373 1.9752 1.94862 1.91982 1.89125 1.86307 1.83539 1.80641 1.77946 1.75333 1.72765 1.70417 1.68106 1.65971 1.64032 1.62386 1.6034 1.5829 1.56022 1.54167 1.53141 1.52329 1.51128 1.52125 1.51127 1.50753 1.51494 1.51777 1.55563 1.56948 1.57866 1.60095 1.61939 1.64399 1.67643 1.70784 1.74259 1.7815 1.81939 1.84942 1.87731
1.89895 1.91676 1.92987
I would want to smooth out this sequence. The technique should be able to eliminate numbers with characteristic of X and Y, i.e. error in mono-increasing or mono-decreasing.
If not eliminate, technique should be able to shift them so that series is not affected by errors.
What I have tried and failed:
I tried to test difference between values. In some special cases it works, but for sequence as presented in this the distance between numbers is not such that I can cut out errors
I tried applying a counter, which is some X, then only change is accepted otherwise point is mapped to previous point only. Here I have great trouble deciding on value of X, because this is based on user-interaction, I am not really controller of it. If user interaction is such that its plot would be a zigzag pattern, I am ending up with 'no user movement data detected at all' situation.
Please share the techniques that you are aware of.
PS: Data made available in this example is a particular case. There is no typical pattern in which numbers are going to occure, but we expect some range to be continuous with all the examples. Solution I am seeking is generic.

I do not know how much effort you want to involve in this problem but if you want theoretical guaranties,
topological persistence seems well adapted to your problem imho.
Basically with that method, you can filtrate local maximum/minimum by fixing a scale
and there are theoritical proofs that says that if you sampling is
close from your function, then you extracts correct number of maximums with persistence.
You can see these slides (mainly pages 7-9 to get the idea) to get an idea of the method.
Basically, if you take your points as a landscape and imagine a watershed starting from maximum height and decreasing, you have some picks.
Every pick has a time where it is born which is the time where it becomes emerged and a time where it dies which is when it merges with an higher pick. Now a persistence diagram pictures a point for every pick where its x/y coordinates are its time of birth/death (by assumption the first pick does not die and is not shown).
If a pick is a global maximal, then it will be further from the diagonal in the persistence diagram than a local maximum pick. To remove local maximums you have to remove picks close to the diagonal. There are fours local maximums in your example as you can see with the persistence diagram of your data (thanks for providing the data btw) and two global ones (the first pick is not pictured in a persistence diagram):
If you noise your data like that :
You will still get a very decent persistence diagram that will allow you to filter local maximum as you want :
Please ask if you want more details or references.

Since you can not decide on a cut off frequency, and not even on the filter you want to use, I would implement several, and let the user set the parameters.
The first thing that I thought of is running average, and you can see that there are so many things to set, to get different outputs.

Related

How to detect and delete noise in rapidminer?

I am new in rapid miner 5, just want to know how to find noise in my data and show them in chart and how to delete them?
A complex problem because it depends what you mean by noise.
If you mean finding individual attributes whose values are plain wrong then you could plot a histogram view and work out some sort of limits on what constitutes a valid value. You could then impose that rule by using Filter Examples to remove them.
If you mean finding attributes that have some sort of random jitter applied to them it would be difficult to detect these. Only by knowing beforehand what the expected shape of the distribution is could you compare with observation and do something about it. However, the action to take is by no means obvious.
If you mean finding examples within an example set that are obviously different from other examples then you could consider using the various outlier functions. The simplest one to get started is Detect Outlier (Distances). This finds a set number of outliers (default 10) based on a distance calculation that uses all the attributes for examples. It creates a new attribute called outlier that is set to true or false. You could then use the Filter Examples operator to remove those that are set to true.
Hope that helps at least as a start.

Neural Network gives same output for different inputs, doesn't learn

I have a neural network written in standard C++11 which I believe follows the back-propagation algorithm correctly (based on this). If I output the error in each step of the algorithm, however, it seems to oscillate without dampening over time. I've tried removing momentum entirely and choosing a very small learning rate (0.02), but it still oscillates at roughly the same amplitude per network (with each network having a different amplitude within a certain range).
Further, all inputs result in the same output (a problem I found posted here before, although for a different language. The author also mentions that he never got it working.)
The code can be found here.
To summarize how I have implemented the network:
Neurons hold the current weights to the neurons ahead of them, previous changes to those weights, and the sum of all inputs.
Neurons can have their value (sum of all inputs) accessed, or can output the result of passing said value through a given activation function.
NeuronLayers act as Neuron containers and set up the actual connections to the next layer.
NeuronLayers can send the actual outputs to the next layer (instead of pulling from the previous).
FFNeuralNetworks act as containers for NeuronLayers and manage forward-propagation, error calculation, and back-propagation. They can also simply process inputs.
The input layer of an FFNeuralNetwork sends its weighted values (value * weight) to the next layer. Each neuron in each layer afterwards outputs the weighted result of the activation function unless it is a bias, or the layer is the output layer (biases output the weighted value, the output layer simply passes the sum through the activation function).
Have I made a fundamental mistake in the implementation (a misunderstanding of the theory), or is there some simple bug I haven't found yet? If it would be a bug, where might it be?
Why might the error oscillate by the amount it does (around +-(0.2 +- learning rate)) even with a very low learning rate? Why might all the outputs be the same, no matter the input?
I've gone over most of it so much that I might be skipping over something, but I think I may have a plain misunderstanding of the theory.
It turns out I was just staring at the FFNeuralNetwork parts too much and accidentally used the wrong input set to confirm the correctness of the network. It actually does work correctly with the right learning rate, momentum, and number of iterations.
Specifically, in main, I was using inputs instead of a smaller array in to test the outputs of the network.

Best way to model music (notes) for fast searching notes at a particular time

I'm working on an iOS music app (written in C++) and my model looks more or less like this:
--Song
----Track
----Track
------Pattern
------Pattern
--------Note
--------Note
--------Note
So basically a Song has multiple Tracks, a Track can have multiple Patterns and a Pattern has multiple Notes. Each one of those things is represented by a class and except for the Song object, they're all stored inside vectors.
Each Note has a "frame" parameter so that I can calculate when a note should be played. For example, if I have 44100 samples / second and the frame for a particular note is 132300 I know that I need that Note at the start of the third second.
My question is how I should represent those notes for best performance? Right now I'm thinking of storing the notes in a vector datamember of each pattern and than loop all the Tracks of the Song, than look the Patterns and than loop the Notes to see which one has a frame datamember that is greater than 132300 and smaller than 176400 (start of 4th second).
As you can tell, that's a lot of loops and a song could be as long as 10 minutes. So I'm wondering if this will be fast enough to calculate all the frames and send them to the buffer on time.
One thing you should remember is that to improve performance, normally memory consumption would have to increase. It is also relevant (and justified) in this case, because I believe you want to store the same data twice, in different ways.
First of all, you should have this basic structure for a song:
map<Track, vector<Pattern>> tracks;
It maps each Track to a vector of Patterns. Map is fine, because you don't care about the order of tracks.
Traversing through Tracks and Patterns should be fast, as their amounts will not be high (I assume). The main performance concern is to loop through thousands of notes. Here's how I suggest to solve it:
First of all, for each Pattern object you should have a vector<Note> as your main data storage. You will write all the changes on the Pattern's contents to this vector<Note> first.
vector<Note> notes;
And for performance considerations, you can have a second way of storing notes:
map<int, vector<Notes>> measures;
This one will map each measure (by its number) in a Pattern to the vector of Notes contained in this measure. Every time data changes in the main notes storage, you will apply the same changes to data in measures. You could also do it only once every time before the playback, or even while playback, in a separate thread.
Of course, you could only store notes in measures, without having to sync two sources of data. But it may be not so convenient to work with when you have to apply mass operations on bunches of notes.
During the playback, before the next measure starts, the following algorithm would happen (roughly):
In every track, find all patterns, for which pattern->startTime <= [current playback second] <= pattern->endTime.
For each pattern, calculate current measure number and get vector<Notes> for the corresponding measure from the measures map.
Now, until the next measure (second?) starts, you only have to loop through current measure's notes.
Just keep those vectors sorted.
During playback, you can just keep a pointer (index) into each vector for the last note player. To search for new notes, you check have to check the following note in each vector, no looping through notes required.
Keep your vectors sorted, and try things out - that is more important and any answer you can receive here.
For all of your questions you should seek to answer then with tests and prototypes, then you will know if you even have a problem. And also while trying it out you will see things that you wouldn't normally see with just the theory alone.
and my model looks more or less like this:
Several critically important concepts are missing from your model:
Tempo.
Dynamics.
Pedal
Instrument
Time signature.
(Optional) Tonality.
Effect (Reverberation/chorus, pitch wheel).
Stereo positioning.
Lyrics.
Chord maps.
Composer information/Title.
Each Note has a "frame" parameter so that I can calculate when a note should be played.
Several critically important concepts are missing from your model:
Articulation.
Aftertouch.
Note duration.
I'd advise to take a look at lilypond. It is typesetting software, but it is also one of the most precise way to represent music in human-readable text format.
My question is how I should represent those notes for best performance?
Put them all into std::map<Timestamp, Note> and find segment you want to playing using lower_bound/upper_bound. Alternatively you could binary search them in flat std::vector as long as data is sorted.
Unless you want to make a "beeper", making music application is much more difficult than you think. I'd strongly recommend to try another project.

Face Recognition Using Backpropagation Neural Network?

I'm very new in image processing and my first assignment is to make a working program which can recognize faces and their names.
Until now, I successfully make a project to detect, crop the detected image, make it to sobel and translate it to array of float.
But, I'm very confused how to implement the Backpropagation MLP to learn the image so it can recognize the correct name for the detected face.
It's a great honor for all experts in stackoverflow to give me some examples how to implement the Image array to be learned with the backpropagation.
It is standard machine learning algorithm. You have a number of arrays of floats (instances in ML or observations in statistics terms) and corresponding names (labels, class tags), one per array. This is enough for use in most ML algorithms. Specifically in ANN, elements of your array (i.e. features) are inputs of the network and labels (names) are its outputs.
If you are looking for theoretical description of backpropagation, take a look at Stanford's ml-class lectures (ANN section). If you need ready implementation, read this question.
You haven't specified what are elements of your arrays. If you use just pixels of original image, this should work, but not very well. If you need production level system (though still with the use of ANN), try to extract more high level features (e.g. Haar-like features, that OpenCV uses itself).
Have you tried writing your feature vectors to an arff file and to feed them to weka, just to see if your approach might work at all?
Weka has a lot of classifiers integrated, including MLP.
As I understood so far, I suspect the features and the classifier you have chosen not to work.
To your original question: Have you made any attempts to implement a neural network on your own? If so, where you got stuck? Note, that this is not the place to request a complete working implementation from the audience.
To provide a general answer on a general question:
Usually you have nodes in an MLP. Specifically input nodes, output nodes, and hidden nodes. These nodes are strictly organized in layers. The input layer at the bottom, the output layer on the top, hidden layers in between. The nodes are connected in a simple feed-forward fashion (output connections are allowed to the next higher layer only).
Then you go and connect each of your float to a single input node and feed the feature vectors to your network. For your backpropagation you need to supply an error signal that you specify for the output nodes. So if you have n names to distinguish, you may use n output nodes (i.e. one for each name). Make them for example return 1 in case of a match and 0 else. You could very well use one output node and let it return n different values for the names. Probably it would even be best to use n completely different perceptrons, i.e. one for each name, to avoid some side-effects (catastrophic interference).
Note, that the output of each node is a number, not a name. Therefore you need to use some sort of thresholds, to get a number-name relation.
Also note, that you need a lot of training data to train a large network (i.e. to obey the curse of dimensionality). It would be interesting to know the size of your float array.
Indeed, for a complex decision you may need a larger number of hidden nodes or even hidden layers.
Further note, that you may need to do a lot of evaluation (i.e. cross validation) to find the optimal configuration (number of layers, number of nodes per layer), or to find even any working configuration.
Good luck, any way!

How to unit test complex methods

I have a method that, given an angle for North and an angle for a bearing, returns a compass point value from 8 possible values (North, NorthEast, East, etc.). I want to create a unit test that gives decent coverage of this method, providing different values for North and Bearing to ensure I have adequate coverage to give me confidence that my method is working.
My original attempt generated all possible whole number values for North from -360 to 360 and tested each Bearing value from -360 to 360. However, my test code ended up being another implementation of the code I was testing. This left me wondering what the best test would be for this such that my test code isn't just going to contain the same errors that my production code might.
My current solution is to spend time writing an XML file with data points and expected results, which I can read in during the test and use to validate the method but this seems exceedingly time consuming. I don't want to write a file that contains the same range of values that my original test contained (that would be a lot of XML) but I do want to include enough to adequately test the method.
How do I test a method without just reimplementing the method?
How do I achieve adequate coverage to have confidence in the method I am testing without having to have test points for all possible inputs and results?
Obviously, don't dwell too much on my specific example as this applies to many situations where there are complex calculations and ranges of data to be tested.
NOTE: I am using Visual Studio and C#, but I believe this question is language-agnostic.
First off, you're right, you do not want your test code to reproduce the same calculation as the code under test. Secondly, your second approach is a step in the right direction. Your tests should contain a specific set of inputs with the pre-computed expected output values for those inputs.
Your XML file should contain just a subset of the input data that you've described. Your tests should ensure that you can handle the extreme ranges of your input domain (-360, 360), a few data points just inside the ends of the range, and a few data points in the middle. Your tests should also check that your code fails gracefully when given values outside the input range (e.g. -361 and +361).
Finally, in your specific case, you may want to have a few more edge cases to make sure that your function correctly handles "switchover" points within your valid input range. These would be the points in your input data where the output is expected to switch from "North" to "Northwest" and from "Northwest" to "West", etc. (don't run your code to find these points, compute them by hand).
Just concentrating on these edge cases and a few cases in between the edges should greatly reduce the amount of points you have to test.
You could possibly re-factor the method into parts that are easier to unit test and write the unit tests for the parts. Then the unit tests for the whole method only need to concentrate on integration issues.
I prefer to do the following.
Create a spreadsheet with right answers. However complex it needs to be is irrelevant. You just need some columns with the case and some columns with the expected results.
For your example, this can be big. But big is okay. You'll have an angle, a bearing and the resulting compass point value. You may have a bunch of intermediate results.
Create a small program that reads the spreadsheet and writes the simplified, bottom-line unittest cases. You want your cases stripped down to
def testCase215n( self ):
self.fixture.setCourse( 215 )
self.fixture.setBearing( 45 )
self.fixture.calculate()
self.assertEquals( "N", self.fixture.compass() )
[That's Python, the same idea would hold for C#.]
The spreadsheet contains the one-and-only authoritative list of right answers. You generate code from this once or twice. Unless, of course, you find an error in your spreadsheet version and have to fix that.
I use a small Python program with xlrd and the Mako template generator to do this. You could do something similar with C# products.
If you can think of a completely different implementation of your method, with completely different places for bugs to hide, you could test against that. I often do things like this when I've got an efficient, but complex implementation of something that could be implemented much more simply but inefficiently. For example, if writing a hash table implementation, I might implement a linear search-based associative array to test it against, and then test using lots of randomly generated input. The linear search AA is very hard to screw up and even harder to screw up such that it's wrong in the same way as the hash table. Therefore, if the hash table has the same observable behavior as the linear search AA, I'd be pretty confident it's correct.
Other examples would include writing a bubble sort to test a heap sort against, or using a known working sort function to find medians and comparing that to the results of an O(N) median finding algorithm implementation.
I believe that your solution is fine, despite using a XML file (I would have used a plain text file). But a more used tactic is to just test limit situations, like using, in your case, a entry value of -360, 360, -361, 361 and 0.
You could try orthogonal array testing to achieve all-pairs coverage instead of all possible combinations. This is a statistical technique based on the theory that most bugs occur due to interactions between pairs of parameters. It can drastically reduce the number of test cases you write.
Not sure how complicated your code is, if it is taking an integer in and dividing it up into 8 or 16 directions on the compass, it is probably a few lines of code yes?
You are going to have a hard time not re-writing your code to test it, depending how you test it. Ideally you want an independent party to write the test code based on the same requirements but without looking at or borrowing your code. This is unlikely to happen in most situations. In this case that may be overkill.
In this specific case I would feed it each number in order from -360 to +360, and print the number and the result (to a text file in a format that can be compiled into another program as a header file). Visually inspect that the direction changes at the desired input. This should be easy to visually inspect and validate. Now you have a table of inputs and valid outputs. Next have a program randomly select from the valid inputs feed it into your code under test and see that the right answer comes out. Do a few hundred of these random tests. At some point you need to validate that numbers less than -360 or greater than +360 are handled per your requirements, either clipping or modulating I assume.
So I took a software testing class link text and basically what you want is to identify the class of inputs.. all real numbers? all integers, only positive, only negative,etc... Then group the output actions. is 360 uniquely different from 359 or do they pretty much end up doing the same thing to the app. Once there do a combination of inputs to outputs.
This all seems abstract and vague but until you provide the method code it's difficult to come up with a perfect strategy.
Another way is to do branch level testing or predicate coverage testing. code coverage isn't fool proof but not covering all your code seems irresponsible.
One approach, probably one to apply in combination with other method of testing, is to see if you can make a function that reverses the method you are testing. In this case, it would take a compass direction(northeast, say), and output a bearing (given the bearing for north). Then, you could test the method by applying it to a series of inputs, then applying the function to reverse the method, and seeing if you get back the original input.
There are complications, particularly if one output corresponds to multiple inputs,but it may be possible in those cases to generate the set of inputs corresponding to a given output, and test that each member of the set (or a certain sample of the elements of the set).
The advantage of this approach is that it doesn't rely on you being able to simulate the method manually, or create an alternative implementation of the method. If the reversal involves a different approach to the problem to that used in the original method, it should reduce the risk of making equivalent mistakes in both.
Psuedocode:
array colors = { red, orange, yellow, green, blue, brown, black, white }
for north = -360 to 361
for bearing = -361 to 361
theColor = colors[dirFunction(north, bearing)] // dirFunction is the one being tested
setColor (theColor)
drawLine (centerX, centerY,
centerX + (cos(north + bearing) * radius),
centerY + (sin(north + bearing) * radius))
Verify Resulting Circle against rotated reference diagram.
When North = 0, you'll get an 8-colored pie chart. As north varies + or -, the pie chart will look the same, but rotated around that many degrees. Test verification is a simple matter of making sure that (a) the image is a properly rotated pie chart and (b) there aren't any green spots in the orange area, etc.
This technique, by the way, is a variation on the world's greatest debugging tool: Have the computer draw you a picture of what =IT= thinks it's doing. (Too often, developers waste hours chasing what they think the computer is doing, only to find that it's doing something completely different.)