Why are these algorithms running faster than they should be? - c++

I created a C++ program that outputs the input size vs. execution time (microseconds) of algorithms and writes the results to a .csv file. Upon importing the .csv in LibreOffice Calc and plotting graphs,
I noticed that binary search for input sizes upto 10000 is running in constant time even though I search for an element not in the array. Similarly, for upto the same input size, merge sort seems to run in linear time instead of the linear-logarithmic time it runs in in all cases.
Insertion Sort and Bubble Sort run just fine and the output plot resembles their worst case quadratic complexity very closely.
I provide the input arrays from a file. For n = 5, the contents of the file are as follows. Each line represents an input array:
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
The results.csv file on running insertion sort is:
Input,Time(ms)
5,4
4,3
3,2
2,2
1,2
The graph of binary search for maximum input 100 is here.
Also the graph of merge sort for maximum input 1000 is here which looks a lot like it is linear (the values in the table also suggest so).
Any help as to why this is happening will be greatly appreciated.
Here is a link to the github repository for the source code: https://github.com/dhanraj-s/Time-Complexity

Complexity is about asymptotic worst case behaviour.
...worst case...
Even a quadratic algorithm may fall back to a linear variant if the input allows. It's complexity is still quadratic because for the worst case, the algorithm can only guarantee a quadratic runtime.
...asymptotic...
It might well be that the asymptotic behaviour for the algorithms starts to settle in only for input sizes much bigger than what you chose.
That being said, in practice complexity alone is not the most useful metric, but if you do care about performance, you need to measure.

Related

What does time complexity actually mean?

I got the task of showing the time taken by the merge sort algorithm theoretically ( n log(n) ) and practically (by program) on a graph by using different values of n and time taken.
In the program, I'm printing the time difference between before calling the function and after the end of the function in microseconds I want to know what dose n log(n) means.
I tryed with this values:
Number of values:
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
program time in micro second:
12964 24961 35905 47870 88764 67848 81782 97739 111702 119682
time using n log n formula:
132877 285754 446180 611508 780482 952360 1.12665e+006 1.30302e+006 1.48119e+006 1.66096e+006
code:
auto start = std::chrono::high_resolution_clock::now();
mergeSort(arr, 0, n - 1);
auto elapsed = std::chrono::high_resolution_clock::now() - start;
long long microseconds = std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
cout << microseconds << " ";
Graph i got:
What time complexity actually means?
I interpret your question in the following way:
Why is the actual time needed by the program not K*n*log(n) microseconds?
The answer is: Because on modern computers, the same step (such as comparing two numbers) does not need the same time if it is executed multiple times.
If you look at the time needed for 50.000 and 60.000 numbers, you can see, that the 50.000 numbers even needed more time than the 60.000 numbers.
The reason might be some interrupt that occurred while the 50.000 numbers were sorted; I assume that you'll get a time between the 40.000 numbers and the 60.000 numbers if you run your program a second time.
In other words: External influences (like interrupts) have more impact on the time needed by your program than the program itself.
I got the task of showing the time taken by the merge sort algorithm theoretically (n log(n)) and practically (by program) on a graph by using different values of n and time taken.
I'd take a number of elements to be sorted that takes about one second. Let's say sorting 3 Million numbers takes one second; then I would sort 3, 6, 9, 12 ... and 30 Million numbers and measure the time.
This reduces the influence of interrupts etc. on the measurement. However, you'll still have some effect of the memory cache in this case.
You can use your existing measurements (especially the 50.000 and the 60.000) to show that for a small number of elements to be sorted, there are other factors that influence the run time.
Note that a graph of y = x log(x) is surprisingly close to a straight line.
This is because the gradient at any point x is 1 + log(x), which is a slowly growing function of x.
In other words, it's difficult within the bounds of experimental error to distinguish between O(N) and O(N log N).
The fact that the blue line is pretty straight is a reasonable verification that the algorithm is not O(N * N), but really without better statistical analysis and program control set-up, one can't say much else.
The difference between the red and blue line is down to "big O" not concerning itself with proportionality constants and other coefficients.
The time complexity is the time a program takes to execute, as a function of the problem size.
The problem size is usually expressed as the number of input elements, but some other measures can sometimes be used (e.g. algorithms on matrices of size NxN can be rated in terms of N instead of N²).
The time can effectively be measured in units of time (seconds), but is often assessed by just counting the number of atomic operations of some kind performed (e.g. the number of comparisons, of array accesses...)
In fact, for theoretical studies, the exact time is not a relevant information because it is not "portable": it strongly depends on the performance of the computer used and also on implementation details.
This is why algorithmicians do not really care about exact figures, but rather on how the time varies with increasing problem sizes. This leads to the concept of asymptotic complexity, which measures the running time to an unknown factor, and for mathematical convenience, an approximation of the running time is often used, to make the computations tractable.
If you study the complexity by pure benchmarking (timing), you can obtain experimental points, which you could call empirical complexity. But some statistical rigor should be applied.
(Some of the other answers do merge the concepts of complexity and asymptotic complexity, but this is not correct.)
In this discussion of complexity, you can replace time by space and you study the memory footprint of the program.
Time complexity has nothing to do with actual time.
It's just a way that helps us to compare different algorithms - which algorithm will run faster.
For example -
In case of sorting: we have bubble sort having time-complexity as O(n^2) and merge sort having time-complexity as O(N log(N)). So, with the help of time-complexity we can say that merge-sort is much better than bubble sort for sorting things.
Big-O notations was created so that we can have generalized way of comparing speed of different algorithms, a way which is not machine dependent.

Calculating quantiles without storing

I wrote c++ code to calculate 119 quantiles (from 10^-7 to 1 - 10^-7) of 100 millions of double precision numbers.
My current implementation stores the numbers in a vector and then it sorts the vector.
Is there any way to calculate the quantiles without storing the numbers?
Thank you
ADDENDUM (sorry for my English):
Here is what I'm doing:
1) generate 20 uniformly distributed random numbers in [0, 1)
2) I feed those numbers into an algorithm that outputs a random number with unknown mean and unknown variance
3) store the number at step 2
repeat 1, 2 and 3 100 millions of times (now I collected 10^8 random numbers with unknown mean and unknown variance).
Now I sort those numbers to calculate 119 quantiles from 10^-7 to 1 - 10^-7 using the formula "R-2, SAS-5":
https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
Since the program is multi-threaded, the memory allocation is too big and I can only use 5 threads instead of 8.
This is a problem from the field of streaming algorithms (where you need to operate on a stream of data without storing each element).
There are well known algorithms for quantile stream algorithms (e.g., here), but if you are willing to use quantile approximations, it's a fairly easy problem. Simply use reservoir sampling to uniformly sample m out of n elements, and calculate the quantiles on the sample (by the method you did: storing the m samples in a vector, and sorting it). The size m influences the approximation's precision (see, e.g., here).
You need to know the set of numbers before you can calculate the quantiles.
This can either be done by storing the numbers, but you can also make/use a multi-pass algorithm, that learns a little part each run.
There are also approximate one-pass algorithms for this problem, if some inaccuracy on the quantiles is acceptable. Here is an example: http://www.cs.umd.edu/~samir/498/manku.pdf
EDIT** Forgot, if your numbers have many duplicates, you just need to store the number and how many times it appears, not each duplicate. Depending on the input data this can be a significant difference.

Creating worse case scenarios with kruskal's algorithm

I have an implementation of Kruskal's algorithm in C++ (using disjoint data set structure). I'm trying to find possible methods of creating worse case scenario test cases for the total running time of the algorithm. I'm confused however on what can make the algorithm result in a worst case scenario when trying to create test cases and was wondering if anyone here might know of possible scenarios that would really make Kruskal's algorithm struggle.
As of now the main test I've considered that might theoretically test the bounds of Kruskal's algorithm would be test cases where all weights are the same. An example would be like the following:
4 4
(4, 4) 4 //(4,4) vertex and weight = 4
(4, 4) 4
(4, 4) 4
(4, 4) 4
What I end up running into is that regardless of what I do, if I try to slow down the algorithm I just end up with no minimum spanning tree and end up failing to actually test the bounds of the algorithm.
To stress Kruskal's algorithm, you need a graph with as many redundant edges as possible, and at least one necessary edge that will be considered last (since Kruskal's algorithm sorts the edges by weight). Here's an example.
The edges with weight 1 are necessary, and will be taken first. The edges with weight 2 are redundant and will cause Kruskal's algorithm to waste time before getting to the edge with weight 3.
Note that the running time of Kruskal's algorithm is determined primarily by the time to sort the edges by weight. Adding additional redundant edges of medium weight will increase the sort time as well as the search time.
Kruskal's algorithm consists of two phases - sorting the edges and than performing union find. If you implement the second phase using disjoint set forest and applying the path compression and union by rank heuristics, the sorting will be much slower than the second phase. Thus to create a worst case scenario for Kruskal you should simply generate a worst case scenario for the sorting algorithm you are using. If you use the built-in sorting, it has an optimization that will actually make it work way faster for already sorted array.

Missing number(s) Interview Question Redux

The common interview problem of determining the missing value in a range from 1 to N has been done a thousand times over. Variations include 2 missing values up to K missing values.
Example problem: Range [1,10] (1 2 4 5 7 8 9 10) = {3,6}
Here is an example of the various solutions:
Easy interview question got harder: given numbers 1..100, find the missing number(s)
My question is that seeing as the simple case of one missing value is of O(n) complexity and that the complexity of the larger cases converge at roughly something larger than O(nlogn):
Couldn't it just be easier to answer the question by saying sort (mergesort) the range and iterate over it observing the missing elements?
This solution should take no more than O(nlogn) and is capable of solving the problem for ranges other than 1-to-N such as 10-to-1000 or -100 to +100 etc...
Is there any reason to believe that the given solutions in the above SO link will be better than the sorting based solution for larger number of missing values?
Note: It seems a lot of the common solutions to this problem, assume an only number theoretic approach. If one is being asked such a question in an S/E interview wouldn't it be prudent to use a more computer science/algorithmic approach, assuming the approach is on par with the number theoretic solution's complexity...
More related links:
https://mathoverflow.net/questions/25374/duplicate-detection-problem
How to tell if an array is a permutation in O(n)?
You are only specifying the time complexity, but the space complexity is also important to consider.
The problem complexity can be specified in term of N (the length of the range) and K (the number of missing elements).
In the question you link, the solution of using equations is O(K) in space (or perhaps a bit more ?), as you need one equation per unknown value.
There is also the preservation point: may you alter the list of known elements ? In a number of cases this is undesirable, in which case any solution involving reordering the elements, or consuming them, must first make a copy, O(N-K) in space.
I cannot see faster than a linear solution: you need to read all known elements (N-K) and output all unknown elements (K). Therefore you cannot get better than O(N) in time.
Let us break down the solutions
Destroying, O(N) space, O(N log N) time: in-place sort
Preserving, O(K) space ?, O(N log N) time: equation system
Preserving, O(N) space, O(N) time: counting sort
Personally, though I find the equation system solution clever, I would probably use either of the sorting solutions. Let's face it: they are much simpler to code, especially the counting sort one!
And as far as time goes, in a real execution, I think the "counting sort" would beat all other solutions hands down.
Note: the counting sort does not require the range to be [0, X), any range will do, as any finite range can be transposed to the [0, X) form by a simple translation.
EDIT:
Changed the sort to O(N), one needs to have all the elements available to sort them.
Having had some time to think about the problem, I also have another solution to propose. As noted, when N grows (dramatically) the space required might explode. However, if K is small, then we could change our representation of the list, using intervals:
{4, 5, 3, 1, 7}
can be represented as
[1,1] U [3,5] U [7,7]
In the average case, maintaining a sorted list of intervals is much less costly than maintaining a sorted list of elements, and it's as easy to deduce the missing numbers too.
The time complexity is easy: O(N log N), after all it's basically an insertion sort.
Of course what's really interesting is that there is no need to actually store the list, thus you can feed it with a stream to the algorithm.
On the other hand, I have quite a hard time figuring out the average space complexity. The "final" space occupied is O(K) (at most K+1 intervals), but during the construction there will be much more missing intervals as we introduce the elements in no particular order.
The worst case is easy enough: N/2 intervals (think odd vs even numbers). I cannot however figure out the average case though. My gut feeling is telling me it should be better than O(N), but I am not that trusting.
Whether the given solution is theoretically better than the sorting one depends on N and K. While your solution has complexity of O(N*log(N)), the given solution is O(N*K). I think that the given solution is (same as the sorting solution) able to solve any range [A, B] just by transforming the range [A, B] to [1, N].
What about this?
create your own set containing all the numbers
remove the given set of numbers from your set (no need to sort)
What's left in your set are the missing numbers.
My question is that seeing as the [...] cases converge at roughly
something larger than O(nlogn) [...]
In 2011 (after you posted this question) Caf posted a simple answer that solves the problem in O(n) time and O(k) space [where the array size is n - k].
Importantly, unlike in other solutions, Caf's answer has no hidden memory requirements (using bit array's, adding numbers to elements, multiplying elements by -1 - these would all require O(log(n)) space).
Note: The question here (and the original question) didn't ask about the streaming version of the problem, and the answer here doesn't handle that case.
Regarding the other answers: I agree that many of the proposed "solutions" to this problem have dubious complexity claims, and if their time complexities aren't better in some way than either:
count sort (O(n) time and space)
compare (heap) sort (O(n*log(n)) time, O(1) space)
...then you may as well just solve the problem by sorting.
However, we can get better complexities (and more importantly, genuinely faster solutions):
Because the numbers are taken from a small, finite range, they can be 'sorted' in linear time.
All we do is initialize an array of 100 booleans, and for each input, set the boolean corresponding to each number in the input, and then step through reporting the unset booleans.
If there are total N elements where each number x is such that 1 <= x <= N then we can solve this in O(nlogn) time complexity and O(1) space complexity.
First sort the array using quicksort or mergesort.
Scan through the sorted array and if the difference between previously scanned number, a and current number, b is equal to 2 (b - a = 2), then the missing number is a+1. This can be extended to condition where (b - a > 2).
Time complexity is O(nlogn)+O(n) almost equal to O(nlogn) when N > 100.
I already answered it HERE
You can also create an array of boolean of the size last_element_in_the_existing_array + 1.
In a for loop mark all the element true that are present in the existing array.
In another for loop print the index of the elements which contains false AKA The missing ones.
Time Complexity: O(last_element_in_the_existing_array)
Space Complexity: O(array.length)
If the range is given to you well ahead, in this case range is [1,10] you can perform XOR operation with your range and the numbers given to you. Since XOR is commutative operation. You will be left with {3,6}
(1 2 3 4 5 6 7 8 9 10) XOR (1 2 4 5 7 8 9 10) ={3,6}

External sorting of ints with O(N log N) reads and O(N) writes

I'm interested in algorithm which I should use to meet the requirements of external sorting of ints with O(N log N) reads and O(N) writes
If you're after an algorithm for that type of sorting (where the data can't all fit into core at once), my solution comes from the very earliest days of the "revolution" when top-end machines had less memory than most modern-day calculators.
I haven't worked out the big-O properties but I think it would be O(n) reads, O(n log n) sort phase (depends on the sort method chosen) and O(n) writes.
Let's say your data set has one million elements and you can only fit 100,000 in memory at a time. Here's what I'd do:
read in the first 100,000, sort them and write that sorted list back out.
do this for each group of 100,000.
run a merge operation on the 10 groups.
In other words, once your 10 groups are sorted within the group, grab the first entry from each group.
Then write that the lowest of those 10 (which is the lowest of the whole million) to the output file and read the next one from that group in its place.
Then just continue selecting the lowest of the 10, writing it out and replacing it from the same group. In that way, the final output is the entire sorted list of a million entries.
Check out external merge sort algorithm.
Try this page: Sorting Algorithms. Besides showing nice animations of several algorithms it explains how they work and their complexity.