Big O Notation for string matching algo

Big O Notation for string matching algo - c++

What would the big O notation of the function foo be?
int foo(char *s1, char *s2)
{
int c=0, s, p, found;
for (s=0; s1[s] != '\0'; s++)
{
for (p=0, found=0; s2[p] != '\0'; p++)
{
if (s2[p] == s1[s])
{
found = 1;
break;
}
}
if (!found) c++;
}
return c;
}
What is the efficiency of the function foo?
a) O(n!)
b) O(n^2)
c) O(n lg(base2) n )
d) O(n)
I would have said O(MN)...?

It is O(n²) where n = max(length(s1),length(s2)) (which can be determined in less than quadratic time - see below). Let's take a look at a textbook definition:
f(n) ∈ O(g(n)) if a positive real number c and positive integer N exist such that f(n) <= c g(n) for all n >= N
By this definition we see that n represents a number - in this case that number is the length of the string passed in. However, there is an apparent discrepancy, since this definition provides only for a single variable function f(n) and here we clearly pass in 2 strings with independent lengths. So we search for a multivariable definition for Big O. However, as demonstrated by Howell in "On Asymptotic Notation with Multiple Variables":
"it is impossible to define big-O notation for multi-variable functions in a way that implies all of these [commonly-assumed] properties."
There is actually a formal definition for Big O with multiple variables however this requires extra constraints beyond single variable Big O be met, and is beyond the scope of most (if not all) algorithms courses. For typical algorithm analysis we can effectively reduce our function to a single variable by bounding all variables to a limiting variable n. In this case the variables (specifically, length(s1) and length(s2)) are clearly independent, but it is possible to bound them:
Method 1
Let x1 = length(s1)
Let x2 = length(s2)
The worst case scenario for this function occurs when there are no matches, therefore we perform x1 * x2 iterations.
Because multiplication is commutative, the worst case scenario foo(s1,s2) == the worst case scenario of foo(s2,s1). We can therefore assume, without loss of generality, that x1 >= x2. (This is because, if x1 < x2 we could get the same result by passing the arguments in the reverse order).
Method 2 (in case you don't like the first method)
For the worst case scenario (in which s1 and s2 contain no common characters), we can determine length(s1) and length(s2) prior to iterating through the loops (in .NET and Java, determining the length of a string is O(1) - but in this case it is O(n)), assigning the greater to x1 and the lesser to x2. Here it is clear that x1 >= x2.
For this scenario, we will see that the extra calculations to determine x1 and x2 make this O(n² + 2n) We use the following simplification rule which can be found here to simplify to O(n²):
If f(x) is a sum of several terms, the one with the largest growth rate is kept, and all others omitted.
Conclusion
for n = x1 (our limiting variable), such that x1 >= x2, the worst case scenario is x1 = x2.
Therefore: f(x1) ∈ O(n²)
Extra Hint
For all homework problems posted to SO related to Big O notation, if the answer is not one of:
O(1)
O(log log n)
O(log n)
O(n^c), 0<c<1
O(n)
O(n log n) = O(log n!)
O(n^2)
O(n^c)
O(c^n)
O(n!)
Then the question is probably better off being posted to https://math.stackexchange.com/

In big-O notation, we always have to define what the occuring variables mean. O(n) doesn't mean anything unless we define what n is. Often, we can omit this information because it is clear from context. For example if we say that some sorting algorithm is O(n log(n)), n always denotes the number of items to sort, so we don't have to always state this.
Another important thing about big-O notation is that it only gives an upper limit -- every algorithm in O(n) is also in O(n^2). The notation is often used as meaning "the algorithm has the exact asymptotic complexity given by the expression (up to a constant factor)", but it's actual definition is "the complexity of the alogrithm is bounded by the given expression (up to a constant factor)".
In the example you gave, you took m and n to be the respective lengths of the two strings. With this definition, the algorithm is indeed O(m n). If we define n to be the length of the longer of the two strings though, we can also write this as O(n^2) -- this is also an upper limit for the complexity of the algorithm. And with the same definition of n, the algorithm is also O(n!), but not O(n) or O(n log(n)).

O(n^2)
The relevant part of the function, in terms of complexity, is the nested loops. The maximum number of iterations is the length of s1 times the length of s2, both of which are linear factors, so the worst-case computing time is O(n^2), i.e. the square of a linear factor. As Ethan said, O(mn) and O(n^2) are effectively the same thing.

Think of it this way:
There are two inputs. If the function simply returned, then it's performance is unrelated to the arguments. This would be O(1).
If the function looped over one string, then the performance is linearly related to the length of that string. Therefore O(N).
But the function has a loop within a loop. The performance is related to the length of s1 and the length of S2. Multiply those lengths together and you get the number of loop iterations. It's not linear any more, it follows a curve. This is O(N^2).

Related

How to calculate complexity if a code contains multiple n complexity loops?

I am a bit confused on the topic of calculating complexity.
I know about Big O and also how to calculate the complexity of loops (nested also).
Suppose I have a program with 3 loops running from 1 to n
for (int i=0;i<n;i++)
{
cout << i ;
}
Now if I ran my CPP code having 3 for loops, will it take 3*n time?
Will the CPP compiler run all the 3 loops at the same time or will do it one after another?
I am very confused on this topic. Please help!

Now if I ran my CPP code having 3 for loops, will it take 3*n time?
Yes, assuming that the time of each loop iteration is the same, but in Big O notation O(3*n) == O(n), so the complexity is still linear.
Will the CPP compiler run all the 3 loops at the same time or will do it one after another?
Implicit concurrency requires a compiler to be 100% sure that parallelizing code will not change the outcome. It can be (and it is, see comments) done for simple operations, but cout << i is unlikely to be parallelized. It can be optimized in different ways however, e.g. if n is known at compile time, compiler could generate the whole string in one go and change the loop into cout << "123456...";.
Also, time complexity and concurrency are rather unrelated topics. Code executed on 20 threads will have the same complexity as code executed on one thread, it will just be faster (or not).

Now if I ran my CPP code having 3 for loops, will it take 3*n time?
Run a thousand loops and still it would be O(n), since while calculating the upper bound time complexity of a function any constant is neglected. So O(n*m) will always be O(n) if m doesn't depend on input size.
Also, the compiler won't run them at the same time, but sequentially one after the other(unless multi-threading, ofc). But even then, 3,10 or 1000 loops one after another will probably be considered O(n) as per the definition as long as the number of times you loop is not dependent on input size.

How to calculate complexity if a code contains multiple n complexity loops?
To understand Big-O notation and asymptotic complexity, it can be useful to resort at least to semi-formal notation.
Consider the problem of finding and upper bound on the asymptotic time complexity a function f(n) based on the growth of n.
To our help, lets loosely define a function or algorithm f being in O(g(n)) (to be picky, O(g(n)) being a set of functions, hence f ∈ O(...), rather than the commonly misused f(n) ∈ O(...)) as follows:
If a function f is in O(g(n)), then c · g(n) is an upper
bound on f(n), for some non-negative constant c such that f(n) ≤ c · g(n)
holds, for sufficiently large n (i.e. , n ≥ n0 for some constant
n0).
Hence, to show that f ∈ O(g(n)), we need to find a set of (non-negative) constants (c, n0) that fulfils
f(n) ≤ c · g(n), for all n ≥ n0, (+)
Let's consider your actual problem
void foo(int n) {
for (int i = 0; i < n; ++i) { std::cout << i << "\n"; }
for (int i = 0; i < n; ++i) { std::cout << i << "\n"; }
for (int i = 0; i < n; ++i) { std::cout << i << "\n"; }
}
and for analyzing the asymptotic behaviour of foo based on growth on n, consider std::cout << i << "\n"; as our basic operation. Thus, based on this definition, foo contains 3 * n basic operations, and we may consider foo mathematically as
f(n) = 3 * n.
Now, we need to find a g(n) and some set of constants c and n0 such that (+) holds. For this particular analysis this is nearly trivial; insert f(n) as above in (+) and let g(n) = n:
3 * n ≤ c · g(n), for all n ≥ n0, [let g(n) = n]
3 * n ≤ c · n, for all n ≥ n0, [choose c = 3]
3 * n ≤ 3 · n, for all n ≥ n0.
The latter holds for any valid n, and we may arbitrarily choose n0 = 0. Thus, as per our definition above of a function f being in O(g(n)), we have showed that f is in O(n).
It is apparent that even if we multiply the loop in foo a multiple of times, as long as this multiple is constant (and not dependent on n itself), we can always find a degenerate number of constants c and n0 that will fulfill (+) for g(n) = n, thus showing that the function f describing the number of basic operations in foo based on n is upper bounded by linear growth.
Now if I ran my CPP code having 3 for loops, will it take 3*n time?
However, it is essential to understand that Big-O notation describes the upper bound on the asymptotic behaviour of a mathematically described algorithm, or e.g. of programmatically implemented function that based on the definition of a basic operation can be described as the former. It does not, however, present an accurate description what runtime you may expect of different variations of how to implement a function. Cache locality, parallelism/vectorization, compiler optimizations and hardware intrinsics, inaccuracy in describing the basic operation are just a few of many factors that make the asymptotic complexity disjoint from actual runtime. The linked list data structure is good example of one where asymptotic analysis is not likely to give a good view of runtime performance (as loss of cache locality, actual size of lists and so on will likely have a larger effect).
For actual runtime of your algorithms, in case you are hitting a bottle neck, actually measuring on target hardware with product representative compiler and optimization flags is key.

Multiple linear operations impact on overall function worse case complexity?

This is perhaps somewhat of a simple question, but I'm going to ask anyway folks:
I've written the below function:
std::vector<int> V={1,2,3,4,5};
int myFunction()
{
for(int i=V.size();i--;){//Do Stuff};
for int e=V.size();i--;){//Do Stuff};
}
This needs to have time complexity worse case O(n) and space complexity of worse case O(1).
Does having two linear operations (for-loops) change the time complexity to something other than O(n)?

No. It does not. O(N) means something like aN+b + something weaker than linear N.
Even something like: T(N)= 5N+100 + Log(N) is considered as O(N).
By "something weaker than linear N", I mean any function R(N) that satisfies the equation:
lim R(N)/N = 0 ; N-->Inifinity //Use L'Hospital's Rule for solving these kind of limits
So O(N) can be written as:
O(N) = aN+b + R(N)
Side Note: Complexity does not equal to Performance. Although (N+N) is still O(N), this does not mean it is not slower than (N). Performance, in its most basic form, is about the number cycles you need to do something not about the theoretical complexity.
However, it should be related at least when N goes to very big number (almost Infinity).

order of an operation when upper bound is fixed

I recently had an interview and was asked to find number of bits in integer supplied. I had something like this:
#include <iostream>
using namespace std;
int givemCountOnes (unsigned int X) {
int count =0;
while (X != 0 ) {
if(X & 1)
count++;
X= X>>1;
}
return count;
}
int main() {
cout << givemCountOnes (4);
return 0;
}
I know there are better approaches but that is not the question here.
Question is, What is the complexity of this program?
Since it goes for number of bits in the input, people say this is O(n) where n is the number of bits in input.
However I feel that since the upper bound is sizeof(unsigned int) i.e. say 64 bits, I should say order is o(1).
Am I wrong?

The complexity is O(N). The complexity rises linearly with the size of the type used (unsigned int).
The upper bound does not matter as it can be extended any time in the future. It also does not matter because there is always an upper bound (memory size, number of atoms in the universe) and then everything could be considered O(1).

I will just add a better solution to above problem.
Use the following step in the Loop
x = x & (x-1);
This will remove the right most ON bit one at a time.
So your loop will at max run as long as there is an ON bit. Terminate when the number approaches 0.
Hence the complexity improves from O(number of bits in int) to O(number of on bits).

The O notation is used to tell what the difference is between different values of n. In this case, n would be the number of bits, as (in your case) the number of bits will change the (relative) time it takes to perform the calculation. So O(n) is correct - a one bit integer will take 1 unit of time, a 32-bit integer will take 32 units of time, and a 64-bit integer will take 64 units of time.
Actually, your algorithm is not dependent on the actual number of bits in the number, but the number of the highest bit set in the number, but that's a different matter. However, since we're typically talking about O as the "worst case", it's still O(n), where n is the number of bits in the integer.
And I can't really think of any method that is sufficiently better than that in terms of O - I can think of methods that improve the number of iterations in the loop (e.g using a 256 entry table, and dealing with 8 bits at a time), but it's still "bigger data -> longer time". Since O(n) and O(n/2) or O(n/8) are all the same (it's just that the overall time is 1/8 in the latter case than in the first case).

Big O notation describes count of algorithm steps in worst case scenario. Which is in this case, when there is a 1 in the last bit. So there will be n iterations/steps when you pass n bit number as input.
Imagine a similar algorithm which searches count of 1's in a list. It's complexity is O(n), where n is a list length. By your assumption, if you always pass fixed size lists as input, then algorithm complexity will become O(1) which is incorrect.
However if you fix bit length in algorithm: i.e. something like for (int i = 0; i < 64; ++i) ... then it will have O(1) complexity, since it doing O(1) operation 64 times, you can ignore constant here. Otherwise O(c*n) is O(n), O(c) is O(1), where c is constant.
Hope all these examples helped. BTW, there is O(1) solution for this, I'll post it when I remember :)

There's one thing should be cleared: the complexity of operation on your integer. It is not clear in this example, as you work on int, which is natural word size on your machine its complexity seem to be just 1.
But O-notation is about large amount of data and large tasks, say you have n bit integer, where n is about 4096 or so. In this case complexity addition, subtraction and shift are of O(n) complexity at least, so your algorithm then applied to such integer would be O(n²) complexity (n operations of O(n) complexity applied).
Direct count algorithm without shifting of whole number (in assumption that one bit test is O(1)) gives O(n log(n)) complexity (it involves up to n additions on log(n) sized integer).
But for fixed length data (which is C's int) big O analysis is simply meaningless, because it based on input data of variable length, say more, data of virtually any length upto infinity.

What is the Big Oh Efficiency of Program with Finite For Loop?

What would be the efficieny of the following program, it is a for loop which runs for a finite no. of times.
for(int i = 0; i < 10; i++ )
{
//do something here, no more loops though.
}
So, what should be the efficiecy. O(1) or O(n) ?

That entirely depends on what is in the for loop. Also, computational complexity is normally measured in terms of the size n of the input, and I can't see anything in your example that models or represents or encodes directly or indirectly the size of the input. There is just the constant 10.
Besides, although sometimes the analysis of computational complexity may give unexpected, surprising results, the correct term is not "Big Oh", but rather Big-O.

You can only talk about the complexity with respect to some specific input to the calculation. If you are looping ten times because there are ten "somethings" that you need to do work for, then your complexity is O(N) with respect to those somethings. If you just need to loop 10 times regardless of the number of somethings - and the processing time inside the loop doesn't change with the number of somethings - then your complexity with respect to them is O(1). If there's no "something" for which the order is greater than 1, then it's fair to describe the loop as O(1).
bit of further rambling discussion...
O(N) indicates the time taken for the work to complete can be reasonably approximated by some constant amount of time plus some function of N - the number of somethings in the input - for huge values of N:
O(N) indicates the time is c + xN, where c is a fixed overhead and x is the per-something processing time,
O(log2N) indicates time is c + x(log2N),
O(N2) indicates time is c + x(N2),
O(N!) indicates time is c + x(N!)
O(NN) indicates time is c + x(NN)
etc..
Again, in your example there's no mention of the number of inputs, and the loop iterations is fixed. I can see how it's tempting to say it's O(1) even if there are 10 input "somethings", but consider: if you have a function capable of processing an arbitrary number of inputs, then decide you'll only use it in your application with exactly 10 inputs and hard-code that, you clearly haven't changed the performance characteristics of the function - you've just locked in a single point on the time-for-N-input curve - and any big-O complexity that was valid before the hardcoding must still be valid afterwards. It's less meaningful and useful though as N of 10 is a small amount and unless you've got an horrific big-O complexity like O(NN) the constants c and x take on a lot more importance in describing the overall performance than they would for huge values of N (where changes in the big-O notation generally have much more impact on performance than changing c or even x - which is of course the whole point of having big-O analysis).

Sure O(1), because here nothing does not depend linearly of n.
EDIT:
Let the loop body to contain some complex action with complexity O(P(n)) in Big O terms.
If we have a constant C number of iterations, the complexity of loop will be O(C * P(n)) = O(P(n)).
Else, now let the number of iterations to be Q(n), depends of n. It makes the complexity of loop O(Q(n) * P(n)).
I'm just trying to say that when the number of iterations is constant, it does not change the complexity of the whole loop.

n in Big O notation denotes the input size. We can't tell what is the complexity, because we don't know what is happening inside the for loop. For example, maybe there are recursive calls, depending on the input size? In this example overall is O(n):
void f(int n) // input size = n
{
for (int i = 0; i < 10; i++ )
{
//do something here, no more loops though.
g(n); // O(n)
}
}
void g(int n)
{
if (n > 0)
{
g(n - 1);
}
}

What is the complexity of the below program?

What is the complexity of the below program? I think it must be O(n), since there is a for loop that runs for n times.
It is a program to reverse the bits in a given integer.
unsigned int reverseBits(unsigned int num)
{
unsigned int NO_OF_BITS = sizeof(num) * 8;
unsigned int reverse_num = 0;
int i;
for (i = 0; i < NO_OF_BITS; i++)
{
if((num & (1 << i)))
reverse_num |= 1 << ((NO_OF_BITS - 1) - i);
}
return reverse_num;
}
What is the complexity of the above program and how? Someone said that the actual complexity is O(log n), but I can't see why.

Considering your above program, the complexity is O(1) because 8 * sizeof(unsigned int) is a constant. Your program will always run in constant time.
However if n is bound to NO_OF_BITS and you make that number an algorithm parameter (which is not the case), then the complexity will be O(n).
Note that with n bits the maximal value possible for num is 2^n, and that in this case if you want to express the complexity as a function of the maximal value allowed for num, the complexity is O(log₂(n)) or O(log(N)).

O-notation describes how the time or space requirements for an algorithm depend on the size of the input (denoted n), in the limit as n becomes very large. The input size is the number of bits required to represent the input, not the range of values that those bits can represent.
(Formally, describing an algorithm with running time t(n) as O(f(n)) means that there is some size N and some constant C for which t(n) <= C*f(n) for all n > N).
This algorithm does a fixed amount of work for each input bit, so the time complexity is O(n). It uses a working space, reverse_num, of the same size as the input (plus some asymptotically smaller variables), so the space complexity is also O(n).
This particular implementation imposes a limit on the input size, and therefore a fixed upper bound on the time and space requirements. This does not mean that the algorithm is O(1), as some answers say. O-notation describes the algorithm, not any particular implementation, and is meaningless if you place an upper bound on the input size.

if n==num, complexity is constant O(1) as the loop always runs fixed number of times. The space complexity is also O(1) as it does not depend on the input

If n is the input number, then NO_OF_BITS is O(log n) (think about it: to represent a binary number n, you need about log2(n) bits).
EDIT: Let me clarify, in the light of other responses and comments.
First, let n be the input number (num). It's important to clarify this because if we consider n to be NO_OF_BITS instead, we get a different answer!
The algorithm is conceptually O(log n). We need to reverse the bits of n. There are O(log n) bits needed to represent the number n, and reversing the bits involves a constant amount of work for each bit; hence the complexity is O(log n).
Now, in reality, built-in types in C cannot represent integers of arbitrary size. In particular, in this implementation uses unsigned int to represent the input, and this type is limited to a fixed number of bits (32 on most systems). Moreover, rather than just going through as many bits as necessary (from the lowest-order bit to the higher-order bit which is 1), this implementation chooses to go through all 32 bits. Since 32 is a constant, this implementation technically runs in O(1) time.
Nonetheless, the algorithm in conceptually O(log n), in the sense that if the input was 2^5, 5 iterations would be sufficient, if the input was 2^10, 10 iterations would be sufficient, and if there were no limit on the range of numbers an unsinged int would represent and the input was 2^1000, then 1000 iterations would be necessary.
Under no circumstances is this algorithm O(n) (unless we define n to be NO_OF_BITS, in which case it is).

You need to be clear what n is. If n is num then of course your code is O(log n) as NO_OF_BITS ~= log_2(n) * 8.
Also, as you are dealing with fixed size values, the whole thing is O(1). Of course, if you are viewing this as a more general concept and are likely to extend it, then feel free to think of it as O(log n) in the more general context where you intend to extend it beyond fixed bit numbers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Big O Notation for string matching algo - c++

Related

How to calculate complexity if a code contains multiple n complexity loops?

Multiple linear operations impact on overall function worse case complexity?

order of an operation when upper bound is fixed

What is the Big Oh Efficiency of Program with Finite For Loop?

What is the complexity of the below program?

Categories

Resources