Related
I am solving a question from LeetCode.com:
Given an array of non-negative integers, you are initially positioned at the first index of the array.
Each element in the array represents your maximum jump length at that position.
Determine if you are able to reach the last index.
For example:
A = [2,3,1,1,4], return true.
A = [3,2,1,0,4], return false.
One of the most voted solutions (here) says that the following is using a greedy approach:
bool canJump(int A[], int n) {
int last=n-1,i,j;
for(i=n-2;i>=0;i--){
if(i+A[i]>=last)last=i;
}
return last<=0;
}
I have two questions:
What is the intuition behind using a greedy algorithm for this?
How is the above solution a greedy algorithm?
I thought this to be solvable by Dynamic Programming. I understand that questions solvable by DP can be solved by greedy method, but what was the intuition behind this particular one that made it more sense to solve by the greedy approach?
This SO question highlights this difference to some extent. I understand this might be a bit more, but if possible, could some one please answer this question in this context? I would highly appreciate that.
Thank you.
Edit: I think one of the reasons of my confusion is over the input [3,3,1,0,4]. As per the greedy paradigm, when i=0 wouldn't we take a jump of size 3 (A[0]) in order to greedily reach the output? But doing this would in fact be incorrect.
According to Wikipedia:
A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.
Here, I want to draw your attention to the key phrase, locally optimal choice at each stage which makes the algorithm paradigm greedy.
Q1. What is the intuition behind using a greedy algorithm for this?
Since in this question, we only care about whether it is possible to reach the last index of the array, we can use a greedy algorithm. A greedy algorithm will select the optimal choice (take the maximum jump) at every step and check at the end whether the maximum index can reach the end.
Say, if we need to find out the jump size at each index to reach the end or need to optimize the number of jumps to reach the end, then the direct use of greedy algorithm won't serve our purpose.
Q2. How is the above solution a greedy algorithm?
The if condition in the above code - if(i+A[i]>=last)last=i; makes the algorithm greedy because we take the maximum jump if it is possible (i+A[i]>=last).
The analysis provided here may help you.
Edit
Let's talk about the input you mentioned - [3,3,1,0,4].
When i=0, algorithm checks what is the maximum index that we can reach from i=0.
Then we will move to the next index and check what is the max index we can reach from i=1. Since we moved to i=1, it is guranteed that we can come to index 1 from index 0 (doesn't matter what is the jump size).
Please note, in this problem, we don't care whether we should take a jump of size 3 at i=0 though we know this will not help us to reach the end. What we care about is whether we can reach the end or beyond that end index by taking jumps.
I have list of seed strings, about 100 predefined strings. All strings contain only ASCII characters.
std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};
My app constantly receives a lot of strings which can contain any characters. I need check each received line and decide whether it contain any of seeds or not. Comparison must be case insensitive.
I need the fastest possible algorithm to test received string.
Right now my app uses this algo:
std::wstring testedStr;
for (auto & seed : seeds)
{
if (boost::icontains(testedStr, seed))
{
return true;
}
}
return false;
It works well, but I'm not sure that this is the most efficient way.
How is it possible to implement the algorithm in order to achieve better performance?
This is a Windows app. App receives valid std::wstring strings.
Update
For this task I implemented Aho-Corasick algo. If someone could review my code it would be great - I do not have big experience with such algorithms. Link to implementation: gist.github.com
If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.
This is also known as a trie, or Radix Tree.
For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:
A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").
There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.
Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.
This lets you check one word against your word-list. Since you're looking for this word-list as sub-strings of your input string, there's going to be more to it than this.
Edit: As other answers have mentioned, the Aho-Corasick algorithm for string matching is a sophisticated algorithm for performing string matching, consisting of a trie with additional links for taking "shortcuts" through the tree and having a different search pattern to accompany this. (As an interesting note, Alfred Aho is also a contributor to the the popular compiler textbook, Compilers: Principles, Techniques, and Tools as well as the algorithms textbook, The Design And Analysis Of Computer Algorithms. He is also a former member of Bell Labs. Margaret J. Corasick does not seem to have too much public information on herself.)
You can use Aho–Corasick algorithm
It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.
It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)
Advantages:
It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
The algorithm is not hard to implement (comparing to suffix structures at least)
You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)
You should try a pre-existing regex utility, it may be slower than your hand-rolled algorithm but regex is about matching multiple possibilities, so it is likely it will be already several times faster than a hashmap or a simple comparison to all strings. I believe regex implementations may already use the Aho–Corasick algorithm mentioned by RiaD, so basically you will have at your disposal a well tested and fast implementation.
If you have C++11 you already have a standard regex library
#include <string>
#include <regex>
int main(){
std::regex self_regex("google|yahoo|stackoverflow");
regex_match(input_string ,self_regex);
}
I expect this to generate the best possible minimum match tree, so I expect it to be really fast (and reliable!)
One of the faster ways is to use suffix tree https://en.wikipedia.org/wiki/Suffix_tree, but this approach has huge disadvantage - it is difficult data structure with difficult constructing. This algorithm allows to build tree from string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm
Edit: As Matthieu M. pointed out, the OP asked if a string contains a keyword. My answer only works when the string equals the keyword or if you can split the string e.g. by the space character.
Especially with a high number of possible candidates and knowing them at compile time using a perfect hash function with a tool like gperf is worth a try. The main principle is, that you seed a generator with your seed and it generates a function that contains a hash function which has no collisions for all seed values. At runtime you give the function a string and it calculates the hash and then it checks if it is the only possible candidate corresponding to the hashvalue.
The runtime cost is hashing the string and then comparing against the only possible candidate (O(1) for seed size and O(1) for string length).
To make the comparison case insensitive you just use tolower on the seed and on your string.
Because number of string is not big (~100), you can use next algo:
Calculate max length of word you have. Let it be N.
Create int checks[N]; array of checksum.
Let's checksum will be sum of all characters in searching phrase. So, you can calculate such checksum for each word from your list (that is known at compile time), and create std::map<int, std::vector<std::wstring>>, where int is checksum of string, and vector should contain all your strings with that checksum.
Create array of such maps for each length (up to N), it can be done at compile time also.
Now move over big string by pointer. When pointer points to X character, you should add value of X char to all checks integers, and for each of them (numbers from 1 to N) remove value of (X-K) character, where K is number of integer in checks array. So, you will always have correct checksum for all length stored in checks array.
After that search on map does there exists strings with such pair (length & checksum), and if exists - compare it.
It should give false-positive result (when checksum & length is equal, but phrase is not) very rare.
So, let's say R is length of big string. Then looping over it will take O(R).
Each step you will perform N operations with "+" small number (char value), N operations with "-" small number (char value), that is very fast. Each step you will have to search for counter in checks array, and that is O(1), because it's one memory block.
Also each step you will have to find map in map's array, that will also be O(1), because it's also is one memory block.
And inside map you will have to search for string with correct checksum for log(F), where F is size of map, and it will usually contain no more then 2-3 strings, so we can in general pretend that it is also O(1).
Also you can check, and if there is no strings with same checksum (that should happens with high chance with just 100 words), you can discard map at all, storing pairs instead of map.
So, finally that should give O(R), with quite small O.
This way of calculating checksum can be changed, but it's quite simple and completely fast, with really rare false-positive reactions.
As a variant on DarioOO’s answer, you could get a possibly faster implementation of a regular expression match, by coding a lex parser for your strings. Though normally used together with yacc, this is a case where lex on its own does the job, and lex parsers are usually very efficient.
This approach might fall down if all your strings are long, as then an algorithm such as Aho-Corasick, Commentz-Walter or Rabin-Karp would probably offer significant improvements, and I doubt that lex implementations use any such algorithm.
This approach is harder if you have to be able to configure the strings without reconfiguration, but since flex is open source you could cannibalise its code.
This answer determines if two strings are permutations by comparing their contents. If they contain the same number of each character, they are obviously permutations. This is accomplished in O(N) time.
I don't like the answer though because it reinvents what is_permutation is designed to do. That said, is_permutation has a complexity of:
At most O(N2) applications of the predicate, or exactly N if the sequences are already equal, where N=std::distance(first1, last1)
So I cannot advocate the use of is_permutation where it is orders of magnitude slower than a hand-spun algorithm. But surely the implementer of the standard would not miss such an obvious improvement? So why is is_permutation O(N2)?
is_permutation works on almost any data type. The algorithm in your link works for data types with a small number of values only.
It's the same reason why std::sort is O(N log N) but counting sort is O(N).
It was I who wrote that answer.
When the string's value_type is char, the number of elements required in a lookup table is 256. For a two-byte encoding, 65536. For a four-byte encoding, the lookup table would have just over 4 billion entries, at a likely size of 16 GB! And most of it would be unused.
So the first thing is to recognize that even if we restrict the types to char and wchar_t, it may still be untenable. Likewise if we want to do is_permutation on sequences of type int.
We could have a specialization of std::is_permutation<> for integral types of size 1 or 2 bytes. But this is somewhat reminiscent of std::vector<bool> which not everyone thinks was a good idea in retrospect.
We could also use a lookup table based on std::map<T, size_t>, but this is likely to be allocation-heavy so it might not be a performance win (or at least, not always). It might be worth implementing one for a detailed comparison though.
In summary, I don't fault the C++ standard for not including a high-performance version of is_permutation for char. First because in the real world I'm not sure it's the most common use of the template, and second because the STL is not the be-all and end-all of algorithms, especially where domain knowledge can be used to accelerate computation for special cases.
If it turns out that is_permutation for char is quite common in the wild, C++ library implementors would be within their rights to provide a specialization for it.
The answer you cite works on chars. It assumes they are 8 bit (not necessarily the case) and so there are only 256 possibilities for each value, and that you can cheaply go from each value to a numeric index to use for a lookup table of counts (for char in this case, the value and the index are the same thing!)
It generates a count of how many times each char value occurs in each string; then, if these distributions are the same for both strings, the strings are permutations of each other.
What is the time complexity?
you have to walk each character of each string, so M+N steps for two inputs of lengths M and N
each of these steps involves incrementing an count in a fixed size table at an index given by the char, so is constant time
So the overall time complexity is O(N+M): linear, as you describe.
Now, std::is_permutation makes no such assumptions about its input. It doesn't know that there are only 256 possibilities, or indeed that they are bounded at all. It doesn't know how to go from an input value to a number it can use as an index, never mind how to do that in constant time. The only thing it knows is how to compare two values for equality, because the caller supplies that information.
So, the time complexity:
we know it has to consider each element of each input at some point
we know that, for each element it hasn't seen before (I'll leave discussion of how that's determined and why that doesn't impact the big O complexity as an exercise), it's not able to turn the element into any kind of index or key for a table of counts, so it has no way of counting how many occurrences of that element exist which is better than a linear walk through both inputs to see how many elements match
so the complexity is going to be quadratic at best.
Today my professor gave us 2 take home questions as practice for upcoming array unit in C and I am wondering what exactly the sorting algorithm these 2 problems resemble and what their Big O is. Now, I am not coming here just expecting answers and I have ALREADY solved them, but I am not confident in my answers so I will post them udner each question and if I am wrong, please correct me and explain my error in thinking.
Question 1:
If we decide to go through an array's(box) element(folders) one at a time. Starting at the first element and comparing it with the next. Then if they are the same the comparison ends, however if both are not equal then it moves on to comparing the next two ELEMENTS [2] and [3]. This process is repeated and will stop once last two elements are compared and note that the array IS already sorted by last name and we are looking for same first name! Example: [ Harper Steven, Hawking John, Ingleton Steven]
My believed answer:
I beleive it is O(n) because it's just going over the elements of an array comparing array[0] to array[1] and then array[2] to array[3] ect ect. This process is linear and continues until the last two are compared. Definitely not logn because we aren't multiplying or diving by 2.
Final Question:
Suppose we have a box of folders each containing info on one person. If we were to want to look for people with same first name, we could first start by placing a sticker on the first folder in the box and then going through the folders after it in an orderly fashion until we find person with same name. If we find a folder with same name, we move that folder next to the folder with a sticker. Once we find ONE case where two people have same name, we stop and go to sleep because we're lazy. If the first search fails however, we simply remove sticker and place it on next folder and then continue as we did earlier. We repeat this process until sticker is on last folder in a scenario where we have no two people with same name.
This array is NOT sorted and compares the first folder with sticker folder[0] with the next i folder[i] elements.
My answer:
I feel like this can't be O(n), but maybe O(n^2) where it kinda feels like we have an array and then we keep repeating the process where n is proportional to the square of the input(folders). I could be wrong here through >.>
You're right on both questions… but it would help to explain things a bit more rigorously. I don't know what the standards of your class are; you probably don't need an actual proof, but showing more detailed reasoning than "we aren't multiplying or dividing by two" never hurts. So…
In the first question, there's clearly nothing happening here but comparisons, so that's what we have to count.
And the worst case is obviously that you have to go through the whole array.
So, in that case, you have to compare a[0] == a[1], then a[1] == a[2], …, a[N-1] == a[N]. For each of N-1 elements, there's 1 comparison. That's N-1 steps, which is obviously O(N).
The fact that the array is sorted turns out to be irrelevant here. (Of course since they're not sorted by your search key—that is, they're sorted by last name, but you're comparing by first name—that was already pretty obvious.)
In the second question, there are two things happening here: comparisons, and then moves.
For the comparisons, the worst case is that you have to do all N searches because there are no matches. As you say, we start with a[0] vs. a[1], …, a[N]; then a[1] vs. a[2], …, a[N], etc. So, N-1 comparisons, then N-2, and so on down to 0. So the total number of comparisons is sum(0…N-1), which is N*(N-1)/2, or N^2/2 - N/2, which is O(N^2).
For the moves, the worst case is that you find a match between a[0] and a[N]. In that case, you have to swap a[N] with a[N-1], then a[N-1] with a[N-2], and so on until you've swapped a[2] with a[1]. So, that's N-1 swaps, which is O(N), which you can ignore because you've already got an O(N^2) term.
As a side note, I'm not sure from your description whether you're talking about an array from a[0…N], or an array of length N, so a[0…N-1], so there could be an off-by-one error in both of the above. But it should be pretty easy to prove to yourself that it doesn't make a difference.
Scenario 2, a method of finding two matching items of arbitrary value, is indeed “quadratic”. Each pass looking for a match of one candidate against all the rest of the elements is O(n). But you repeat that n times. The value of n drops as you go so a detailed number of comparisons would be closer to n+(n-1)+(n-2)+ … 1 which is (n+1)×(n/2) or ½(n²+n) but all we care about is the overall shape of the curve so don't worry about the lower order terms or the coefficients. It's O(n²).
This question already has answers here:
How to find the only number in an array that doesn't occur twice [duplicate]
(5 answers)
Closed 7 years ago.
What would be the best algorithm for finding a number that occurs only once in a list which has all other numbers occurring exactly twice.
So, in the list of integers (lets take it as an array) each integer repeats exactly twice, except one. To find that one, what is the best algorithm.
The fastest (O(n)) and most memory efficient (O(1)) way is with the XOR operation.
In C:
int arr[] = {3, 2, 5, 2, 1, 5, 3};
int num = 0, i;
for (i=0; i < 7; i++)
num ^= arr[i];
printf("%i\n", num);
This prints "1", which is the only one that occurs once.
This works because the first time you hit a number it marks the num variable with itself, and the second time it unmarks num with itself (more or less). The only one that remains unmarked is your non-duplicate.
By the way, you can expand on this idea to very quickly find two unique numbers among a list of duplicates.
Let's call the unique numbers a and b. First take the XOR of everything, as Kyle suggested. What we get is a^b. We know a^b != 0, since a != b. Choose any 1 bit of a^b, and use that as a mask -- in more detail: choose x as a power of 2 so that x & (a^b) is nonzero.
Now split the list into two sublists -- one sublist contains all numbers y with y&x == 0, and the rest go in the other sublist. By the way we chose x, we know that a and b are in different buckets. We also know that each pair of duplicates is still in the same bucket. So we can now apply ye olde "XOR-em-all" trick to each bucket independently, and discover what a and b are completely.
Bam.
O(N) time, O(N) memory
HT= Hash Table
HT.clear()
go over the list in order
for each item you see
if(HT.Contains(item)) -> HT.Remove(item)
else
ht.add(item)
at the end, the item in the HT is the item you are looking for.
Note (credit #Jared Updike): This system will find all Odd instances of items.
comment: I don't see how can people vote up solutions that give you NLogN performance. in which universe is that "better" ?
I am even more shocked you marked the accepted answer s NLogN solution...
I do agree however that if memory is required to be constant, then NLogN would be (so far) the best solution.
Kyle's solution would obviously not catch situations were the data set does not follow the rules. If all numbers were in pairs the algorithm would give a result of zero, the exact same value as if zero would be the only value with single occurance.
If there were multiple single occurance values or triples, the result would be errouness as well.
Testing the data set might well end up with a more costly algorithm, either in memory or time.
Csmba's solution does show some errouness data (no or more then one single occurence value), but not other (quadrouples). Regarding his solution, depending on the implementation of HT, either memory and/or time is more then O(n).
If we cannot be sure about the correctness of the input set, sorting and counting or using a hashtable counting occurances with the integer itself being the hash key would both be feasible.
I would say that using a sorting algorithm and then going through the sorted list to find the number is a good way to do it.
And now the problem is finding "the best" sorting algorithm. There are a lot of sorting algorithms, each of them with its strong and weak points, so this is quite a complicated question. The Wikipedia entry seems like a nice source of info on that.
Implementation in Ruby:
a = [1,2,3,4,123,1,2,.........]
t = a.length-1
for i in 0..t
s = a.index(a[i])+1
b = a[s..t]
w = b.include?a[i]
if w == false
puts a[i]
end
end
You need to specify what you mean by "best" - to some, speed is all that matters and would qualify an answer as "best" - for others, they might forgive a few hundred milliseconds if the solution was more readable.
"Best" is subjective unless you are more specific.
That said:
Iterate through the numbers, for each number search the list for that number and when you reach the number that returns only a 1 for the number of search results, you are done.
Seems like the best you could do is to iterate through the list, for every item add it to a list of "seen" items or else remove it from the "seen" if it's already there, and at the end your list of "seen" items will include the singular element. This is O(n) in regards to time and n in regards to space (in the worst case, it will be much better if the list is sorted).
The fact that they're integers doesn't really factor in, since there's nothing special you can do with adding them up... is there?
Question
I don't understand why the selected answer is "best" by any standard. O(N*lgN) > O(N), and it changes the list (or else creates a copy of it, which is still more expensive in space and time). Am I missing something?
Depends on how large/small/diverse the numbers are though. A radix sort might be applicable which would reduce the sorting time of the O(N log N) solution by a large degree.
The sorting method and the XOR method have the same time complexity. The XOR method is only O(n) if you assume that bitwise XOR of two strings is a constant time operation. This is equivalent to saying that the size of the integers in the array is bounded by a constant. In that case you can use Radix sort to sort the array in O(n).
If the numbers are not bounded, then bitwise XOR takes time O(k) where k is the length of the bit string, and the XOR method takes O(nk). Now again Radix sort will sort the array in time O(nk).
You could simply put the elements in the set into a hash until you find a collision. In ruby, this is a one-liner.
def find_dupe(array)
h={}
array.detect { |e| h[e]||(h[e]=true; false) }
end
So, find_dupe([1,2,3,4,5,1]) would return 1.
This is actually a common "trick" interview question though. It is normally about a list of consecutive integers with one duplicate. In this case the interviewer is often looking for you to use the Gaussian sum of n-integers trick e.g. n*(n+1)/2 subtracted from the actual sum. The textbook answer is something like this.
def find_dupe_for_consecutive_integers(array)
n=array.size-1 # subtract one from array.size because of the dupe
array.sum - n*(n+1)/2
end