I have a small sample of data and would like to find the value that would occur with 90% chance of the actual value falling short and 10% probability of exceeding actual value.
I understand how to do this by hand, but I know there is a SAS function that can do this with certain calculated inputs.
Can anybody help?
Thanks!
Related
I am trying to implement the apriori and fpgrowth algorithm to some characterisation data that I have. The data I have are already binarised and it is composed of 1's (passes), 0's (fails) and Null values.
I want to clarify with my preprocessing pipeline if it would be good enough in practise. I have already removed rows/columns from the dataset that have the ENTIRE row/column with Null values and now I am still left with some Null values.
I was thinking of applying categorical PCA to decrease the size of the dataset even more, but I believe that wouldn't good enough practise as it requires to impute and fill the missing values with something else, and I don't need that as it will affect final results.
So, what I am actually doing to address the issue of the Null values, is to fill them up with a 0. I do this, because the algorithms above try to measure the frequency of items that exist in a database. And I guess, the 1's are the datapoints that are keeping count of that frequency. Hence, the rest should be 0.
But, I am still not sure if it's good enough practise because it looks like I am filling up the Null values with a 0 (failure) as if it has been measured.
Any help on this, if I am tackling my problem correctly or if I should try something else would be very much appreciated. :)
I'm working on a problem where I have an entire table from a database in memory at all times, with a low range and high range of 9-digit numbers. I'm given a 9-digit number that I need to use to lookup the rest of the columns in the table based on whether that number falls in the range. For example, if the range was 100,000,000 to 125,000,000 and I was given a number 117,123,456, then I would know that I'm in the 100-125 mil range, and whatever vector of data that points to is what I will be using.
Now the best I can think of for lookup time is log(n) run time. This is OK, at best, but still pretty slow. The table has at least 100,000 entries and I will need to look up values in this table tens-of-thousands, if not hundred-thousands of times, per execution of this application (10+ times/day).
So I was wondering if it was possible to use an unordered_set instead, writing my own Hash function that ALWAYS returns the same hash-value for every number in range. Using the same example above, 100,000,000 through 125,000,000 will always return, for example, a hash value of AB12CD. Then when I use the lookup value of 117,123,456, I will get that same AB12CD hash and have a lookup time of O(1).
Is this possible, and if so, any ideas how?
Thanks in advance.
Yes. Assuming that you can number your intervals in order, you could fit a polynomial to your cutoff values, and receive an index value from the polynomial. For instance, with cutoffs of 100,000,000, 125,000,000, 250,000,000, and 327,000,000, you could use points (100, 0), (125, 1), (250, 2), and (327, 3), restricting the first derivative to [0, 1]. Assuming that you have decently-behaved intervals, you'll be able to fit this with an (N+2)th-degree polynomial for N cutoffs.
Have a table of desired hash values; use floor[polynomial(i)] for the index into the table.
Can you write such a hash function? Yes. Will evaluating it be slower than a search? Well there's the catch...
I would personally solve this problem as follows. I'd have a sorted vector of all values. And then I'd have a jump table of indexes into that vector based on the value of n >> 8.
So now your logic is that you look in the jump table to figure out where you are jumping to and how many values you should consider. (Just look at where you land versus the next index to see the size of the range.) If the whole range goes to the same vector, you're done. If there are only a few entries, do a linear search to find where you belong. If they are a lot of entries, do a binary search. Experiment with your data to find when binary search beats a linear search.
A vague memory suggests that the tradeoff is around 100 or so because predicting a branch wrong is expensive. But that is a vague memory from many years ago, so run the experiment for yourself.
I am trying to use the kmean function in OpenCV to pre-classify 36000 sample images into 100+ classes (to reduce my work to prepare train data for supervised learning). In this function there are two parameters which I do not really understand: cv::TermCriteria::EPS and cv::TermCriteria::COUNT.
cv::kmeans(dataset.t(), K, kmean_labels, cv::TermCriteria( cv::TermCriteria::EPS + cv::TermCriteria::COUNT, 10, 1.0),
3, cv::KMEANS_PP_CENTERS, kmean_centers);
In OpenCV documents, it explains that:
cv::TermCriteria::EPS: the desired accuracy or change in parameters at which the iterative algorithm stops.
cv::TermCriteria::COUNT: the maximum number of iterations or elements to compute.
The explanation above is not quite clear for me. Can anyone help to explain more and show how to find good values for COUNT and EPS?
Thank you very much.
There are no magical numbers that will fit all applications (otherwise they wouldn't be parameters).
Kmeans is an iterative algorithm, which will move towards an optimum and each iteration should get better, but you need to tell your algorithm when to stop.
Using cv::TermCriteria::COUNT, you tell the algorithm: you can perform x iterations, then stop. But this doesn't guarantee you any precision.
Using cv::TermCriteria::EPS, you tell the algorithm to continue its iterations, untill the difference between two successive iterations becomes sufficiently small. The parameter EPS tell the algorithm how small this difference should become. This depends of course on the dataset that you are feeding to the algorithm. Suppose you multiply all your data points by 10; then EPS should vary accordingly (quadratically I suppose, but not sure about that).
When you use both both parameters; you tell the algorithm to stop when one of both conditions is fullfilled; for example: stop iterating when the difference between two successive runs is smaller than 0.1, OR when you have done 10 iterations.
in conclusion: only analysis of your datasets, and trial and error can give you decent values...
I just saw that function in code, and intuitively it should return the next prime number greater than the argument. When I call it that way, however, I get 53! and then when I pass in 54 i get 97. I'm not finding a description of what it does online, can anybody point me to one or does anybody know what this does?
It returns the next prime that is sufficiently greater than the specified prime to be worth reorganizing a hash table to that number of buckets. If it returned the very next prime, you'd be reorganizing your hash tables way too often. It is an implementation detail of the hash table code and it's not meant to be used by outside code.
I have a quad tree where the leaf nodes represent pixels. There is a method that prunes the quad tree, and another method that calculates the number of leaves that would remain if the tree were to be pruned. The prune method accepts an integer tolerance which is used as a limit in the difference between nodes to check whether to prune or not. Anyway, I want to write a function that takes one argument leavesLeft. What this should do is calculate the minimum tolerance necessary to ensure that upon pruning, no more than leavesLeft remain in the tree. The hint is to use binary search recursively to do this. My question is that I am unable to make the connection between binary search and this function that I need to write. I am not sure how this would be implemented. I know that the maximum tolerance allowable is 256*256*3=196,608, but apart from that, I dont know how to get started. Can anyone guide me in the right direction?
You want to look for Nick's spatial index quadtree and hilbert curve.
Write a method that just tries all possible tolerance values and checks if that would leave exactly enough nodes.
Write a test case and see if it works.
Don't do step 1. Use a binary search in all possible tolerance values to do same as step 1, but quicker.
If you don't know how to implement a binary search, you can better try it on a simple integer array first. Anyway, if you do step 1, just store the number of leaves left in array (with the tolerance as index). And then execute a binary search on that. To turn this into step 3, notice you don't need the entire array. Simply replace the array by a function that calculates the values and you're done.
Say you plugged in tolerance = 0. Then you'd get an extreme answer like zero leaves left or all the leaves left (not sure how it works from your question). Say you plug in tolerance = 196,608. You'd get an answer at the other extreme. You know the answer you're looking for is somewhere in between.
So you plug in a tolerance number halfway between 0 and 196,608: a tolerance of 98,304. If the number of leaves left is too high, then you know the correct tolerance is somewhere between 0 and 98,304; if it's too low, the correct tolerance is somewhere between 98,304 and 196,608. (Or maybe the high/low parts are reversed; I'm not sure from your question.)
That's binary search. You keep cutting the range of possible values in half by checking the one in the middle. Eventually you narrow it down to the correct tolerance. Of course you'll need to look up binary search in order to implement it correctly.