Garbage values in output - c++

I have a following program where in , I am trying to set wrong positions and i am getting garbage values in Average, Min & Max. Can we get NOVALUE if the array is of Non valid values as output

No, you can't. Fix your program so that there are not invalid values.

Your question is not well formulated, but i think i know what you mean.
Best is: do not add invalid values to you statistic.
This would destroy your statistic.
Next: record how many values you have added to your statistic. if it is 0 then the statistic is empty. This is especially important when you initialize your min, and max value with maximum or special values, which will get overwritten when the first real value appreas.

You can add a check for the case when you have "wrong positions" and then use a flag or store a distinct flag value in Average, Min and Max.

Related

Weka discretising attribute where one value is most common by far

I have a dataset in which there's a numerical attribute for the 'number of days since last contact' but the value -1 is being used to indicate that there hasn't been a last contact. It is by far the largest value for this attribute.
My idea is to discretise this attribute but how can I ensure there is a 'no contact'/-1 bin?
Also, is this the correct approach to this problem?
The proper approach supposedly is to
Split the data into -1 and everything else
Apply binning to the values in the 'everything else' set only
Concatenate the data sets again (it may be good to shuffle, too)
If anyone else has this question and can't find an answer, here's how I did it based on Anony-Mousse's method. The filter documentation for MathExpression gives a good example of splitting into arbitrary bins.
Split using the MathExpression filter e.g. ifelse(A>0, 2, 1) to split into two bins: above and below 0. I used ifelse(A>0, ifelse(A>400, 21, ceil(A/20)+1), 1) to bin my -1 and >400 values, and for in between values to be in bins of width 20.
Convert using numericToNominal

Python, getting a centered average with a catch

So, my assignment is to get the centered average of a list, much like a few of the other posts on here like this one (https://codereview.stackexchange.com/questions/108404/how-to-get-a-centered-mean-excluding-max-and-min-value-of-a-list-in-python) and a few others. However, my professor has told us we are not allowed to use min, max, or sort to solve this. So what I have right now is this, it is still a work in progress:
def centered_average(nums):
high=0
low=0
a=0
b=0
for i in range(len(nums)):
if nums[i]>a:
a=nums[i]
high=a
for i in range(len(nums)):
if nums[i]<b:
b=nums[i]
low=b
total=sum(nums)
average=(total-high-low)/(len(nums)-2)
print(average)
My problem is that I can't get low to be recognized as the lowest number in the list. For example, if I input [1,2,3,4,5] as the list, my function should return 5 as the high, 1 as the low, and 3 as the centered average since 2+3+4 is 9/3=3. However, what I have right there returns the low as 0. I think it is because of the (lens(nums) since it would think the first number is a 0. I'm not sure how I should fix this.
Note: I am still a beginner at this stuff so I know what I have might not be the best or that the error could be simple to fix, but I am still in the process of learning so any help and advice would be much appreciated.
The problem is your starting the running minimum (and running maximum) as 0.
Start the running minimum as float("inf") (as everything is guaranteed to be less than that). Start the running maximum as float("-inf") (as everything is guaranteed to be greater than that).
Or, start both as the first element of the list (which is either a true minimum/maximum, or there's another element that is lower/higher than it).

Get an interval of values possible in regression using sci-kit learn machine learning

I am trying to use regression to predict a value. For a given set of independent variables, I get a fixed number as the expected value. However, is it possible to get a range of value, so as to say that the maximum possible value be say x and minimum possible value be say y.
Using
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)
pred = regr.predict([[a, b]])
The value of pred comes out be say 10 , but i would rather want something like max = 12 and min = 8
Simply saying a range of values.
UPDATE
Tried looking into GMM, not sure if that work for this.
Tried Gausian processes but it again give a single value something like 11.137631, which really doesn't as i am looking for a range of value rather than a single value.
The linear regression always gives same result for a given input vector, however using a random forest regressor in iteration gives different result on each iteration and that can be used to get a Minimum and maximum possible value from a given input vector forecast.

Boost::accumulator's percentile giving wrong values

I am using boost::accumulators::tag::extended_p_square_quantile for calculating percentile. In this, I also need to feed probabilities to the accumulator so I did this m_acc = AccumulatorType(boost::accumulators::extended_p_square_probabilities = probs); where probs is a vector containing the probabilities.
Values in the prob vector are {0.5,0.3,0.9,0.7}
I provided some sample values to accumulator.
But when I try to get the percentile using boost::accumulators::quantile(m_acc, boost::accumulators::quantile_probability = probs[0]); it returns incorrect values and even nan sometimes.
What is wrong here?
I ran into this problem and wasted lot of time to figured out the problem and therefore want to answer this.
Problem is with the vector. Vector should be shorted in increasing order of its values.
Change the vector values to this {0.3,0.5,0.7,0.9} and it will work as expected.
So if someone is using tag::extended_p_square_quantile for percentile(which supports multiple probabilities) then (s)he needs to give probabilities(vector/array/list) in sorted order.
This isn't the case with tag::p_square_quantile because we can give only one value(probability) in it.

rrdtool info ds unknown_sec meaning

The answer to this question is probably obvious, but I couldn't figure out what it means nor find the documentation for it. When I run rrdtool info, I get ds[source].unknown_sec = 0, and I am not sure what it exactly means...Any help or pointers would be appreciated!
The unknown_sec is the number of seconds for which the value of the DS is Unknown. This could be because the supplied value was Unknown, or was outside the specified range, or because the time since the last sample exceeds the heartbeat time (which marks everything since then as Unknown).
The amount of Unknown time in an interval is then used in combination with the xff fraction in the RRA definitions to determine if a consolodated data point stored in the RRA is Unknown.
Actually, I think I figured out what it means. If two data samples exceed the heartbeat value, then the entire interval between the two data samples are marked as unknown. So unknown_sec = 0 means there hasn't been two data samples that exceed the heartbeat value.