Calculating the mean for a set of numbers while neglecting outliers

Calculating the mean for a set of numbers while neglecting outliers - c++

First of all this is more of a math question than it is a coding one, so please be patient.
I am trying to figure out an algorithm to calculate the mean for a set of numbers. However I need to neglect any numbers that are not close to the majority of the results. Here is an example of what I am trying to do:
Lets say I have a set of numbers that are similar to the following:
{ 90, 91, 92, 95, 2, 3, 99, 92, 92, 91, 300, 91, 92, 99, 400 }
it is clear for the set above that the majority of numbers lies between 90 and 99, however I have some outliers like { 300, 400, 2, 3 }. I need to calculate the mean of those numbers while neglecting the outliers. I do remember reading about something like that in a statistics class but I cant remember what was it or how to approach the solution.
Will appreciate any help..
Thanks

What you could do is:
estimate the percentage of outliers in your data: about 25% (4/15) of the provided dataset,
compute the adequate quantiles: 8-quantiles for your dataset, so as to exclude the outliers,
estimate the mean between the first and the last quantile.
PS: Outliers constituting 25% of your dataset is a lot!
PPS: For the second step, we assumed outliers are "symmetrically distributed". See the graph below, where we use 4-quantiles and 1.5 times the interquartile range (IQR) from Q1 and Q3:

First you need to determine the standard deviation and mean of the full set. The outliers are those values that are greater than 3 standard deviations from the (full set) mean.

A simple method that works well is to take the median instead of the average. The median is far more robust to outliers.
You could also minimize a Geman-McClure function:
x^ = argmin sum( G(xi - x')), where G(x) = x^2/(x^2+sigma^2)
If you plot the G function, you will find that it saturates, which is a good way of softly excluding outliers.

I'd be very careful about this. You could be doing yourself and your conclusions a great disservice.
How is your program supposed to recognize outliers? The normal distribution would say that 99.9% of the values fall within +/- three standard deviations of the mean, so you could calculate both for the unfiltered data, exclude the values that fall outside the assumed range, and recalculate.
However, you might be throwing away something significant by doing so. The normal distribution isn't sacred; outliers are far more common in real life than the normal distribution would suggest. Read Taleb's "Black Swan" to see what I mean.
Be sure you understand fully what you're excluding before you do so. I think it'd be far better to leave all the data points, warts and all, and come up with a good written explanation for them.
Another approach would be a use an alternate measure like median, which is less sensitive to outliers than mean. It's harder to calculate, though.

Related

Google Sheet Not Multiplying in IF Formula

I am trying to calculate a price based on rates. If the number is $20,000 or below, there is a flat rate of $700. If the number is between 20,001.01 and $50,000, the rate is 3.5% of the number. The rates continue to lower as the numbers go up. I can get Google Sheets to populate the box with $700 if it is below $20,000 but I can't seem to make it do the multiplication for me. The cell just shows C4*.035
I want it to multiply the number shown in the C4 square by the percentage listed.
Here is the code as it currently sits:
=if(AND(C4<=20000),"700",IF(AND(C4>=20000.01,C4<=50000),"C4*.035", IF(AND(C4>=50000.01,C4<=100000),"C4*.0325", IF(AND(C4>=100000.01),"C4*.03"))))
Note, I know nothing about coding so I apologize if it is sloppy or doesn't make sense. I tried to copy and format based on an example that was semi similar to mine.

try:
=IF(C4<=20000, 700,
IF(AND(C4>=20000.01, C4<=50000), C4*0.035,
IF(AND(C4>=50000.01, C4<=100000), C4*0.0325,
IF(AND(C4>=100000.01), C4*0.03))))

As BigBen noticed in his comment - there's a mistake in your formula. You should not use " " around the formula if you don't want it to be read as a string.
Actually more clean solution is using IFS formula for this task.
=ifs(C4<=20000,700,
C4<=50000,C4*0.035,
C4<=100000,C4*0.0325,
C4>100000,C4*0.03)

Divide Large Number into Multiple Smaller Units

I am new to Stack overflow but I am having logic issues with Google sheets and need some advice on how to proceed.
My goal is I have 3 cup sizes, small medium and large. Example: Small holds 50ML, Medium 100ML,Large 200ML,etc
I want to take a large number and evenly distribute it among the cups to show me how many cups I will need with the least amount of cups used. Example : 170ML = 1 Large, 0 Medium, 0 Small. I only need 1 large to hold everything. Also, 240ML would suggest 1 Large, 0 medium, 1 small. Since small can hold the remaining 40 and medium would be too big of a container.
Problem is I don't understand how to break down my original large number into the smaller number as I have to check and compare if it will fit, I also have to be able to add more if there's a remainder and as far as I know the Google sheet functions only run and represent numbers once.
I've already tried breaking it down to my large container first then in my second row with medium cups I take the first result and subtract from my large number to see if anything is left. If there is, all I can do is add 1 or set the number, I can't seem to scale it up if it requires more than 1 cup which isn't what I want.
I've been going crazy trying to find an easy solution to this but it seems to get more complex as if I need IF statements of some kind.
If anyone has any ideas I'd be happy to hear them out.

you could use IF statement for example like this:
=ARRAYFORMULA(IF(A2:A<>"",
IF(A2:A>=B1, QUOTIENT(A2:A, B1),
IF(A2:A> C1+D1, QUOTIENT(B1, A2:A), 0)), ))

QR code generation algorithm data masking implementation case analysis

I'm implementing a QR code generation algorithm as explained on thonky.com and I'm trying to understand one of the cases:
As stated on this page, after getting the percentage of the dark modules out of the whole code, I should take the two nearest multiples of five (for example 45 and 50 for 48%), but what if the percentage is a multiple of 5? for example 45.0? what numbers should be taken? 45? 40 and 50? 45 and 40? 45 and 50? something totally different? I couldn't find any answer to that anywhere...
Thank you very much in advance for the help!

Indeed the Thonky tutorial is unclear in this respect, so let's turn to the official standard (behind a paywall at ISO but easy to find online). Section 8.8.2, page 52, Table 24:
Evaluation condition: 50 ± (5 × k)% to 50 ± (5 × (k + 1))%
Points: N₄ × k
Here, N₄ = 10, and
k is the rating of the deviation of the proportion of dark modules in the symbol from 50% in steps of 5%.
So for for exactly 45% dark modules, you'd have k = 1, resulting in a penalty of 10 points.
Also note that it doesn't really matter if you get this slightly wrong. Because the mask pattern identifier is encoded in the format string, a reader can still decode the QR code even if you accidentally chose a slightly suboptimal mask pattern.

Achieving Mutability When Mixing Primitives and Cocoa Collections

Okay, I think I might be over-complicating this issue but I truly am stuck. Basically, I am trying to model a weight set, specifically an olympic weight set. So I have the bar which is 45 lbs, then I have 2 weights of 2.5 lbs, 4 of 5 lbs, and then 2 of 10, 25, 35, and 45 respectively. This makes a total of 300 lbs.
bar = 45 lbs
2 of 2.5
4 of 5
2 of 10
2 of 25
2 of 35
2 of 45
I want to model this weight set so that I have this information: the weight and the quantity of weights I have. I know I could hard-code this but I eventually want to let the user enter how many of each weight they may have.
Anyways, originally I thought I could simply have an NSDictionary with the key being the weight, such as 35, and the value being the quantity.
Naturally I cannot store primitives in an NSDictionary or other Cocoa collection, so I have to encapsulate each integer in an NSNumber. However, the point of my modeling this weight set is so that I can simulate the use of certain weights. For example, if I use a 35 lbs. weight that takes 2 off (one for each side), so I have to go and edit the value for the 35 lbs. weight to reflect the fact that I have deducted 2 from the quantity.
This involves the tedious task of unboxing the NSNumber, converting back to a primitive, doing the math, and then re-boxing into an NSNumber and assigning that new result to the appropriate location in the NSDictionary. After searching around a bit, I confirmed my initial premonition that this was not a good idea.
So I have a couple questions. First of all, is there a better way of modeling a weight set aside from using a dictionary-style solution? If not, what is the suggested way to go about doing this? Do I have to leave the cocoa-realm and resort to using some sort of C++ STL template such as a map?
I have seen some information on NSDecimalNumber, should I just use that?
Like I said, I wouldn't be surprised if I am over-complicating this. I would really appreciate any help, thanks.
EDIT: I am beginning to think that the weight set 'definition' as described should indeed be immutable, as it is a definition after all. Then when I use a certain weight, I can add to some sort of tally. The thing is that the tally will also be some form of collection whose values I will be modifying (adding to), so that I can correlate it to the specific weight. So I guess I am in the same problem.
I think where I am trying to get at is creating a 'clone' so to speak of the weight set definition which I can easily modify (to simulate the usage of individual weights).
Sorry, I'm burned out.

Storing this in a dictionary isn't a natural fit. I think the best approach would be to make a Weight class that represents the weights, and stick them in an NSCountedSet. You can get the individual kinds of Weight and the counts for each kind, and you can get the weight of the whole set with [weightSet valueForKeyPath:#"#sum.weightInPounds"] (assuming the Weights have a weightInPounds property that represents how heavy they are).
You could use NSNumbers in the NSCountedSet and sum them with #sum.integerValue if you wanted, but it seems a bit awkward to me. At any rate, NSCountedSet is definitely a more natural collection than an NSDictionary for storing — well, a counted set.

There's nothing wrong with storing your numbers in an NSDictionary! The question you referenced was referring to complicated, frequent math. Converting from NSNumber and back is slow compared to simple int addition, but is still super-fast compared to human perception. I think your dictionary idea is EDIT: not as good as Chuck's NSCountedSet idea. :)

How to unit test complex methods

I have a method that, given an angle for North and an angle for a bearing, returns a compass point value from 8 possible values (North, NorthEast, East, etc.). I want to create a unit test that gives decent coverage of this method, providing different values for North and Bearing to ensure I have adequate coverage to give me confidence that my method is working.
My original attempt generated all possible whole number values for North from -360 to 360 and tested each Bearing value from -360 to 360. However, my test code ended up being another implementation of the code I was testing. This left me wondering what the best test would be for this such that my test code isn't just going to contain the same errors that my production code might.
My current solution is to spend time writing an XML file with data points and expected results, which I can read in during the test and use to validate the method but this seems exceedingly time consuming. I don't want to write a file that contains the same range of values that my original test contained (that would be a lot of XML) but I do want to include enough to adequately test the method.
How do I test a method without just reimplementing the method?
How do I achieve adequate coverage to have confidence in the method I am testing without having to have test points for all possible inputs and results?
Obviously, don't dwell too much on my specific example as this applies to many situations where there are complex calculations and ranges of data to be tested.
NOTE: I am using Visual Studio and C#, but I believe this question is language-agnostic.

First off, you're right, you do not want your test code to reproduce the same calculation as the code under test. Secondly, your second approach is a step in the right direction. Your tests should contain a specific set of inputs with the pre-computed expected output values for those inputs.
Your XML file should contain just a subset of the input data that you've described. Your tests should ensure that you can handle the extreme ranges of your input domain (-360, 360), a few data points just inside the ends of the range, and a few data points in the middle. Your tests should also check that your code fails gracefully when given values outside the input range (e.g. -361 and +361).
Finally, in your specific case, you may want to have a few more edge cases to make sure that your function correctly handles "switchover" points within your valid input range. These would be the points in your input data where the output is expected to switch from "North" to "Northwest" and from "Northwest" to "West", etc. (don't run your code to find these points, compute them by hand).
Just concentrating on these edge cases and a few cases in between the edges should greatly reduce the amount of points you have to test.

You could possibly re-factor the method into parts that are easier to unit test and write the unit tests for the parts. Then the unit tests for the whole method only need to concentrate on integration issues.

I prefer to do the following.
Create a spreadsheet with right answers. However complex it needs to be is irrelevant. You just need some columns with the case and some columns with the expected results.
For your example, this can be big. But big is okay. You'll have an angle, a bearing and the resulting compass point value. You may have a bunch of intermediate results.
Create a small program that reads the spreadsheet and writes the simplified, bottom-line unittest cases. You want your cases stripped down to
def testCase215n( self ):
self.fixture.setCourse( 215 )
self.fixture.setBearing( 45 )
self.fixture.calculate()
self.assertEquals( "N", self.fixture.compass() )
[That's Python, the same idea would hold for C#.]
The spreadsheet contains the one-and-only authoritative list of right answers. You generate code from this once or twice. Unless, of course, you find an error in your spreadsheet version and have to fix that.
I use a small Python program with xlrd and the Mako template generator to do this. You could do something similar with C# products.

If you can think of a completely different implementation of your method, with completely different places for bugs to hide, you could test against that. I often do things like this when I've got an efficient, but complex implementation of something that could be implemented much more simply but inefficiently. For example, if writing a hash table implementation, I might implement a linear search-based associative array to test it against, and then test using lots of randomly generated input. The linear search AA is very hard to screw up and even harder to screw up such that it's wrong in the same way as the hash table. Therefore, if the hash table has the same observable behavior as the linear search AA, I'd be pretty confident it's correct.
Other examples would include writing a bubble sort to test a heap sort against, or using a known working sort function to find medians and comparing that to the results of an O(N) median finding algorithm implementation.

I believe that your solution is fine, despite using a XML file (I would have used a plain text file). But a more used tactic is to just test limit situations, like using, in your case, a entry value of -360, 360, -361, 361 and 0.

You could try orthogonal array testing to achieve all-pairs coverage instead of all possible combinations. This is a statistical technique based on the theory that most bugs occur due to interactions between pairs of parameters. It can drastically reduce the number of test cases you write.

Not sure how complicated your code is, if it is taking an integer in and dividing it up into 8 or 16 directions on the compass, it is probably a few lines of code yes?
You are going to have a hard time not re-writing your code to test it, depending how you test it. Ideally you want an independent party to write the test code based on the same requirements but without looking at or borrowing your code. This is unlikely to happen in most situations. In this case that may be overkill.
In this specific case I would feed it each number in order from -360 to +360, and print the number and the result (to a text file in a format that can be compiled into another program as a header file). Visually inspect that the direction changes at the desired input. This should be easy to visually inspect and validate. Now you have a table of inputs and valid outputs. Next have a program randomly select from the valid inputs feed it into your code under test and see that the right answer comes out. Do a few hundred of these random tests. At some point you need to validate that numbers less than -360 or greater than +360 are handled per your requirements, either clipping or modulating I assume.

So I took a software testing class link text and basically what you want is to identify the class of inputs.. all real numbers? all integers, only positive, only negative,etc... Then group the output actions. is 360 uniquely different from 359 or do they pretty much end up doing the same thing to the app. Once there do a combination of inputs to outputs.
This all seems abstract and vague but until you provide the method code it's difficult to come up with a perfect strategy.
Another way is to do branch level testing or predicate coverage testing. code coverage isn't fool proof but not covering all your code seems irresponsible.

One approach, probably one to apply in combination with other method of testing, is to see if you can make a function that reverses the method you are testing. In this case, it would take a compass direction(northeast, say), and output a bearing (given the bearing for north). Then, you could test the method by applying it to a series of inputs, then applying the function to reverse the method, and seeing if you get back the original input.
There are complications, particularly if one output corresponds to multiple inputs,but it may be possible in those cases to generate the set of inputs corresponding to a given output, and test that each member of the set (or a certain sample of the elements of the set).
The advantage of this approach is that it doesn't rely on you being able to simulate the method manually, or create an alternative implementation of the method. If the reversal involves a different approach to the problem to that used in the original method, it should reduce the risk of making equivalent mistakes in both.

Psuedocode:
array colors = { red, orange, yellow, green, blue, brown, black, white }
for north = -360 to 361
for bearing = -361 to 361
theColor = colors[dirFunction(north, bearing)] // dirFunction is the one being tested
setColor (theColor)
drawLine (centerX, centerY,
centerX + (cos(north + bearing) * radius),
centerY + (sin(north + bearing) * radius))
Verify Resulting Circle against rotated reference diagram.
When North = 0, you'll get an 8-colored pie chart. As north varies + or -, the pie chart will look the same, but rotated around that many degrees. Test verification is a simple matter of making sure that (a) the image is a properly rotated pie chart and (b) there aren't any green spots in the orange area, etc.
This technique, by the way, is a variation on the world's greatest debugging tool: Have the computer draw you a picture of what =IT= thinks it's doing. (Too often, developers waste hours chasing what they think the computer is doing, only to find that it's doing something completely different.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js