Boost::accumulator's percentile giving wrong values - c++

I am using boost::accumulators::tag::extended_p_square_quantile for calculating percentile. In this, I also need to feed probabilities to the accumulator so I did this m_acc = AccumulatorType(boost::accumulators::extended_p_square_probabilities = probs); where probs is a vector containing the probabilities.
Values in the prob vector are {0.5,0.3,0.9,0.7}
I provided some sample values to accumulator.
But when I try to get the percentile using boost::accumulators::quantile(m_acc, boost::accumulators::quantile_probability = probs[0]); it returns incorrect values and even nan sometimes.
What is wrong here?

I ran into this problem and wasted lot of time to figured out the problem and therefore want to answer this.
Problem is with the vector. Vector should be shorted in increasing order of its values.
Change the vector values to this {0.3,0.5,0.7,0.9} and it will work as expected.
So if someone is using tag::extended_p_square_quantile for percentile(which supports multiple probabilities) then (s)he needs to give probabilities(vector/array/list) in sorted order.
This isn't the case with tag::p_square_quantile because we can give only one value(probability) in it.

Related

If statement with multiplication

I am using this formula.
=IF(OR($U2>EOMONTH(AF$1,0),$V2<AF$1),“”,$AE2*(MIN(EOMONTH(AF$1,0),$V2)-MAX(AF$1,$U2)+1)/IF(MOD(YEAR($AF$1),4)=0,366,365))
I need to make this formula multiply by whatever is in cell C3 if the value returned is greater than 0. I tried the product function but I am having trouble with placing it into the formula. I keep getting the error
#Value
You could try placing your whole formula in a MAX function:
=MAX(your_formula,0)*C3
If you are looking for a different solution, it would be very helpful to have some more information or a data example.

Ordering by sum of difference

I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?
best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)
The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.

Python, getting a centered average with a catch

So, my assignment is to get the centered average of a list, much like a few of the other posts on here like this one (https://codereview.stackexchange.com/questions/108404/how-to-get-a-centered-mean-excluding-max-and-min-value-of-a-list-in-python) and a few others. However, my professor has told us we are not allowed to use min, max, or sort to solve this. So what I have right now is this, it is still a work in progress:
def centered_average(nums):
high=0
low=0
a=0
b=0
for i in range(len(nums)):
if nums[i]>a:
a=nums[i]
high=a
for i in range(len(nums)):
if nums[i]<b:
b=nums[i]
low=b
total=sum(nums)
average=(total-high-low)/(len(nums)-2)
print(average)
My problem is that I can't get low to be recognized as the lowest number in the list. For example, if I input [1,2,3,4,5] as the list, my function should return 5 as the high, 1 as the low, and 3 as the centered average since 2+3+4 is 9/3=3. However, what I have right there returns the low as 0. I think it is because of the (lens(nums) since it would think the first number is a 0. I'm not sure how I should fix this.
Note: I am still a beginner at this stuff so I know what I have might not be the best or that the error could be simple to fix, but I am still in the process of learning so any help and advice would be much appreciated.
The problem is your starting the running minimum (and running maximum) as 0.
Start the running minimum as float("inf") (as everything is guaranteed to be less than that). Start the running maximum as float("-inf") (as everything is guaranteed to be greater than that).
Or, start both as the first element of the list (which is either a true minimum/maximum, or there's another element that is lower/higher than it).

Get an interval of values possible in regression using sci-kit learn machine learning

I am trying to use regression to predict a value. For a given set of independent variables, I get a fixed number as the expected value. However, is it possible to get a range of value, so as to say that the maximum possible value be say x and minimum possible value be say y.
Using
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)
pred = regr.predict([[a, b]])
The value of pred comes out be say 10 , but i would rather want something like max = 12 and min = 8
Simply saying a range of values.
UPDATE
Tried looking into GMM, not sure if that work for this.
Tried Gausian processes but it again give a single value something like 11.137631, which really doesn't as i am looking for a range of value rather than a single value.
The linear regression always gives same result for a given input vector, however using a random forest regressor in iteration gives different result on each iteration and that can be used to get a Minimum and maximum possible value from a given input vector forecast.

recursively find subsets

Here is a recursive function that I'm trying to create that finds all the subsets passed in an STL set. the two params are an STL set to search for subjects, and a number i >= 0 which specifies how big the subsets should be. If the integer is bigger then the set, return empty subset
I don't think I'm doing this correctly. Sometimes it's right, sometimes its not. The stl set gets passed in fine.
list<set<int> > findSub(set<int>& inset, int i)
{
list<set<int> > the_list;
list<set<int> >::iterator el = the_list.begin();
if(inset.size()>i)
{
set<int> tmp_set;
for(int j(0); j<=i;j++)
{
set<int>::iterator first = inset.begin();
tmp_set.insert(*(first));
the_list.push_back(tmp_set);
inset.erase(first);
}
the_list.splice(el,findSub(inset,i));
}
return the_list;
}
From what I understand you are actually trying to generate all subsets of 'i' elements from a given set right ?
Modifying the input set is going to get you into trouble, you'd be better off not modifying it.
I think that the idea is simple enough, though I would say that you got it backwards. Since it looks like homework, i won't give you a C++ algorithm ;)
generate_subsets(set, sizeOfSubsets) # I assume sizeOfSubsets cannot be negative
# use a type that enforces this for god's sake!
if sizeOfSubsets is 0 then return {}
else if sizeOfSubsets is 1 then
result = []
for each element in set do result <- result + {element}
return result
else
result = []
baseSubsets = generate_subsets(set, sizeOfSubsets - 1)
for each subset in baseSubssets
for each element in set
if no element in subset then result <- result + { subset + element }
return result
The key points are:
generate the subsets of lower rank first, as you'll have to iterate over them
don't try to insert an element in a subset if it already is, it would give you a subset of incorrect size
Now, you'll have to understand this and transpose it to 'real' code.
I have been staring at this for several minutes and I can't figure out what your train of thought is for thinking that it would work. You are permanently removing several members of the input list before exploring every possible subset that they could participate in.
Try working out the solution you intend in pseudo-code and see if you can see the problem without the stl interfering.
It seems (I'm not native English) that what you could do is to compute power set (set of all subsets) and then select only subsets matching condition from it.
You can find methods how to calculate power set on Wikipedia Power set page and on Math Is Fun (link is in External links section on that Wikipedia page named Power Set from Math Is Fun and I cannot post it here directly because spam prevention mechanism). On math is fun mainly section It's binary.
I also can't see what this is supposed to achieve.
If this isn't homework with specific restrictions i'd simply suggest testing against a temporary std::set with std::includes().