i ran into thos quesiton in a google search.... they look pretty common, but i couldn't find a decent answer. any tips/links ?
1.Remove duplicates in array in O(n) without extra array
2.Write a program whose printed output is an exact copy of the source. Needless to say, merely echoing the actual source file is not allowed.
(1) isn't possible unless the array is presorted. The basic answer is to keep two pointers into the array, one walking forward searching for unequal elements, and one trailing pointer. When the forward pointer encounters an unequal element, it copies it into the trailing pointer and increments the trailing pointer.
(2) I don't have one handy. This sounds like a pretty terrible interview question. In most interpreted languages, a 0 byte (empty) source file is valid input, and prints out nothing.. that should count.
For (1), you probably need more constraints than you've given. However, look up radix sort.
For (2), look up quine.
For your second question google for quine, you will find lots of answers!
The closest you can get to one is using a hashtable to store seen elements and assigning each non-duplicated one to an appropriate value at the start of the array (this would leave several irrelevant ones at the end) - this would take O(n) time but is not the sort of thing you want to have to write during a job interview. Alternatively, as long as the list is sorted just check whether each element is equal to the previous.
For 2 would just manually printing the content's of the file be allowed? (if so the question is more than a little bit pointless).
Edit:
Here is a fast Perl version of my solution to the first - in c++ you would have to create the hash manually:
# Return an unsorted version of an array without duplicates
sub unsortedDedup {
my %seen, my #return;
map {
$seen{$_} = 1
&& push #return, $_
unless (defined $seen{$_})
} #_;
#return;
}
The STL is often not an option in such interview questions, but here's one way to do #1 using the STL, although it does incur an additional sort (as explained by Terry's answer):
#include <iostream>
#include <algorithm>
#include <iterator>
int main()
{
int a[] = { 2, 2, 3, 2, 1, 4, 1, 3, 4, 1 };
int * end = a + sizeof(a) / sizeof(a[0]);
std::sort(a, end); // O(n log n)
end = std::unique(a, end); // O(n)
std::copy(a, end, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
}
Here's the result:
$ ./a.out
1 2 3 4
std::unique() is generally implemented using the same technique Terry described in his answer (see bits/stl_algo.h in g++'s STL implementation for an example of how to implement it).
For #2, there are a number of answers for different languages here: http://www.nyx.net/~gthompso/quine.htm
There is also an alternative quine in c++ here: http://npcomplete.weebly.com/1/post/2010/02/self-reproducing-c-program-quine.html
Related
I am trying to solve the programming problem firstDuplicate on codesignal. The problem is "Given an array a that contains only numbers in the range 1 to a.length, find the first duplicate number for which the second occurrence has minimal index".
Example: For a = [2, 1, 3, 5, 3, 2] the output should be firstDuplicate(a) = 3
There are 2 duplicates: numbers 2 and 3. The second occurrence of 3 has a smaller index than the second occurrence of 2 does, so the answer is 3.
With this code I pass 21/23 tests, but then it tells me that the program exceeded the execution time limit on test 22. How would I go about making it faster so that it passes the remaining two tests?
#include <algorithm>
int firstDuplicate(vector<int> a) {
vector<int> seen;
for (size_t i = 0; i < a.size(); ++i){
if (std::find(seen.begin(), seen.end(), a[i]) != seen.end()){
return a[i];
}else{
seen.push_back(a[i]);
}
}
if (seen == a){
return -1;
}
}
Anytime you get asked a question about "find the duplicate", "find the missing element", or "find the thing that should be there", your first instinct should be use a hash table. In C++, there are the unordered_map and unordered_set classes that are for such types of coding exercises. The unordered_set is effectively a map of keys to bools.
Also, pass you vector by reference, not value. Passing by value incurs the overhead of copying the entire vector.
Also, that comparison seems costly and unnecessary at the end.
This is probably closer to what you want:
#include <unordered_set>
int firstDuplicate(const vector<int>& a) {
std::unordered_set<int> seen;
for (int i : a) {
auto result_pair = seen.insert(i);
bool duplicate = (result_pair.second == false);
if (duplicate) {
return (i);
}
}
return -1;
}
std::find is linear time complexity in terms of distance between first and last element (or until the number is found) in the container, thus having a worst-case complexity of O(N), so your algorithm would be O(N^2).
Instead of storing your numbers in a vector and searching for it every time, Yyu should do something like hashing with std::map to store the numbers encountered and return a number if while iterating, it is already present in the map.
std::map<int, int> hash;
for(const auto &i: a) {
if(hash[i])
return i;
else
hash[i] = 1;
}
Edit: std::unordered_map is even more efficient if the order of keys doesn't matter, since insertion time complexity is constant in average case as compared to logarithmic insertion complexity for std::map.
It's probably an unnecessary optimization, but I think I'd try to take slightly better advantage of the specification. A hash table is intended primarily for cases where you have a fairly sparse conversion from possible keys to actual keys--that is, only a small percentage of possible keys are ever used. For example, if your keys are strings of length up to 20 characters, the theoretical maximum number of keys is 25620. With that many possible keys, it's clear no practical program is going to store any more than a minuscule percentage, so a hash table makes sense.
In this case, however, we're told that the input is: "an array a that contains only numbers in the range 1 to a.length". So, even if half the numbers are duplicates, we're using 50% of the possible keys.
Under the circumstances, instead of a hash table, even though it's often maligned, I'd use an std::vector<bool>, and expect to get considerably better performance in the vast majority of cases.
int firstDuplicate(std::vector<int> const &input) {
std::vector<bool> seen(input.size()+1);
for (auto i : input) {
if (seen[i])
return i;
seen[i] = true;
}
return -1;
}
The advantage here is fairly simple: at least in a typical case, std::vector<bool> uses a specialization to store bools in only one bit apiece. This way we're storing only one bit for each number of input, which increases storage density, so we can expect excellent use of the cache. In particular, as long as the number of bytes in the cache is at least a little more than 1/8th the number of elements in the input array, we can expect all of seen to be in the cache most of the time.
Now make no mistake: if you look around, you'll find quite a few articles pointing out that vector<bool> has problems--and for some cases, that's entirely true. There are places and times that vector<bool> should be avoided. But none of its limitations applies to the way we're using it here--and it really does give an advantage in storage density that can be quite useful, especially for cases like this one.
We could also write some custom code to implement a bitmap that would give still faster code than vector<bool>. But using vector<bool> is easy, and writing our own replacement that's more efficient is quite a bit of extra work...
As the title says, I have in my mind some methods to do it but I don't know which is fastest.
So let's say that we have a: vector<int> vals with some values
1
After my vals are added
sort(vals.begin(), vals.end());
auto last = unique(vals.begin(), vals.end());
vals.erase(last, vals.end());
2
Convert to set after my vals are added:
set<int> s( vals.begin(), vals.end() );
vals.assign( s.begin(), s.end() );
3
When i add my vals, i check if it's already in my vector:
if( find(vals.begin(), vals.end(), myVal)!=vals.end() )
// add my val
4
Use a set from start
Ok, I've got these 4 methods, my questions are:
1 From 1, 2 and 3 which is the fastest?
2 Is 4 faster than the first 3?
3 At 2 after converting the vector to set, it's more convenabile to use the set to do what I need to do or should I do the vals.assign( .. ) and continue with my vector?
Question 1: Both 1 and 2 are O(n log n), 3 is O(n^2). Between 1 and 2, it depends on the data.
Question 2: 4 is also O(n log n) and can be better than 1 and 2 if you have lots of duplicates, because it only stores one copy of each. Imagine a million values that are all equal.
Question 3: Well, that really depends on what you need to do.
The only thing that can be said without knowing more is that your alternative number 3 is asymptotically worse than the others.
If you're using C++11 and don't need ordering, you can use std::unordered_set, which is a hash table and can be significantly faster than std::set.
Option 1 is going to beat all the others. The complexity is just O(N log N) and the contiguous memory of vector keeps the constant factors low.
std::set typically suffers a lot from non-contiguous allocations. It's not just slow to access those, just creating them takes significant time as well.
These methods all have their shortcomings although (1) is worth looking at.
But, take a look at this 5th option: Bear in mind that you can access the vector's data buffer using the data() function. Then, bearing in mind that no reallocation will take place since the vector will only ever get smaller, apply the algorithm that you learn at school:
unduplicate(vals.data(), vals.size());
void unduplicate(int* arr, std::size_t length) /*Reference: Gang of Four, I think*/
{
int *it, *end = arr + length - 1;
for (it = arr + 1; arr < end; arr++, it = arr + 1){
while (it <= end){
if (*it == *arr){
*it = *end--;
} else {
++it;
}
}
}
}
And resize the vector at the end if that is what's required. This is never worse than O(N^2), so is superior to insertion-sort or sort then remove approaches.
Your 4th option might be an idea if you can adopt it. Profile the performance. Otherwise use my algorithm from the 1960s.
I've got a similar problem recently, and experimented with 1, 2, and 4, as well as with unordered_set version of 4. In turned out that the best performance was the latter one, 4 with unordered_set in place of set.
BTW, that empirical finding is not too surprising if one considers that both set and sort were a bit of overkill: they guaranteed relative order of unequal elements. For example inputs 4,3,5,2,4,3 would lead to sorted output of unique values 2,3,4,5. This is unnecessary if you can live with unique values in arbitrary order, i.e. 3,4,2,5. When you use unordered_set it doesn't guarantee the order, only uniqueness, and therefore it doesn't have to perform the additional work of ensuring the order of different elements.
I'm looking for a data structure (and an C++ implementation) that allows to search (efficiently) for all elements having an integer value within a given interval. Example: say the set contains:
3,4,5,7,11,13,17,20,21
Now I want to now all elements from this set within [5,19]. So the answer should be 5,7,11,13,17
For my usage trivial search is not an option, as the number of elements is large (several million elements) and I have to do the search quite often. Any suggestions?
For this, you typically use std::set, that is an ordered set which has a search tree built on top (at least that's one possible implementation).
To get the elements in the queried interval, find the two iterators pointing at the first and last element you're looking for. That's a use case of the algorithm std::lower_bound and upper_bound to consider both interval limits as inclusive: [x,y]. (If you want to have the end exclusive, use lower_bound also for the end.)
These algorithms have logarithmic complexity on the size of the set: O(log n)
Note that you may also use a std::vector if you sort it before applying these operations. This might be advantageous in some situations, but if you always want to sort the elements, use std::set, as it does that automatically for you.
Live demo
#include <set>
#include <algorithm>
#include <iostream>
int main()
{
// Your set (Note that these numbers don't have to be given in order):
std::set<int> s = { 3,4,5,7,11,13,17,20,21 };
// Your query:
int x = 5;
int y = 19;
// The iterators:
auto lower = std::lower_bound(s.begin(), s.end(), x);
auto upper = std::upper_bound(s.begin(), s.end(), y);
// Iterating over them:
for (auto it = lower; it != upper; ++it) {
// Do something with *it, or just print *it:
std::cout << *it << '\n';
}
}
Output:
5
7
11
13
17
For searching within the intervals like you mentioned, Segment trees are the best. In competitive programming, several questions are based on this data structure.
One such implementation could be found here:
http://www.sanfoundry.com/cpp-program-implement-segement-tree/
You might need to modify the code to suit your question, but the basic implementation remains the same.
I have a few questions related to portions of my code.
The first has to do with how I find the length of an array of arrays of strings. I'm using the following as a map for a Calculus tool I'm using.
std::string dMap[][10] = {{"x", "1"}, {"log(x)", "1/x"}, {"e^x", "e^x"}};
I'm wondering how to do the equivalent of
int arr[] = {1, 69, 2};
int arrlen = sizeof(arr)/sizeof(int);
with an array of elements of type std::string. Also, is there a better way of storing symbolic representations of (f(x), f'(x)) pairs? I'm trying to not use C++11.
My next question has to do with a procedure I wrote that isn't working. Here it is:
std::string CalculusWizard::composeFunction(const std::string & fx, const char & x, const std::string & gx)
{
/* Return fx compose gx, i.e. return a string that is gx with every instance of the character x replaced
by the equation gx.
E.g. fx="x^2", x="x", gx="sin(x)" ---> composeFunction(fx, x, gx) = "(sin(x))^2"
*/
std::string hx(""); // equation to return
std::string lastString("");
for (std::string::const_iterator it(fx.begin()), offend(fx.end()); it != offend; ++it)
{
if (*it == x)
{
hx += "(" + gx + ")";
lastString.erase(lastString.begin(), lastString.end());
}
else
{
lastString.push_back(*it);
}
}
return hx;
}
First of all, where's the bug in the procedure? It's not working when I test it out.
Second of all, when trying to make a string empty again, is it faster to do
lastString.erase(lastString.begin(), lastString.end());
or
lastString = "";
???
Thank you for your time.
Question 1) Understand that you can't, and really don't need to, calculate the size of a String this way. Just ask it how big it is and it will tell you.
// comparing size, length, capacity and max_size
#include <iostream>
#include <string>
int main ()
{
std::string str ("Test string");
std::cout << "size: " << str.size() << "\n";
std::cout << "length: " << str.length() << "\n";
std::cout << "capacity: " << str.capacity() << "\n";
std::cout << "max_size: " << str.max_size() << "\n";
return 0;
}
http://www.cplusplus.com/reference/string/string/capacity/
As for an array of strings, well go read this:
How to determine the size of an array of strings in C++?
Check out David RodrÃguez's answer.
Question 2) The better way might be to define a FunctionPair class depending on what you're doing with them. Vector<FunctionPair> might come in handy.
If FunctionPair doesn't end up with any behavior (functions) associated with it then a struct might be enough: std::pair<std::string, std::string> could also be shoved into a vector.
You don't need a map unless your going to use one function string to look up the other.
http://www.cplusplus.com/reference/map/map/
Question 3) A little better description of what's not working would help. I notice lastString doesn't impact hx at all.
Question 4) "Second of all" Fastest is nothing to worry about at this point. Write what is easiest to look at until all the bugs are gone. "Premature optimization is the root of all evil", Donald Knuth.
Tip: Look into how the replace function might help you do the composition replacements:
http://www.cplusplus.com/reference/string/string/replace/
As the above commenter said, you shouldn't use c-style arrays even if you just want to make things 'easy'.
In reality doing things like that makes things harder.
c-style arrays aren't bounds checked. That means they are a source of bugs due to memory unsafety and can lead to all kinds of issues from segfaulting to corrupting data as you read random data from unrelated blocks of memory or even worse write to them.
#include <iostream>
int main() {
int nums[] = {1, 2, 3};
std::cout << nums[3] << std::endl;
}
.
# ./a.out
4196544
No programmer is perfect, every time you implement something like that there is a percentage chance you will be off by one in your bounds or something. Even if you are some programming god most people have to work on a team with people who aren't. In many cases no one will even notice since not every time will cause anything obvious. Memory can be randomly corrupted without causing anything to crash horribly. Until you make a totally unrelated change that causes the memory to be in a different order.
But when you do notice it will often effect something totally unrelated that you code sometime later. Given the fact that you will likely implement many such arrays in your programming lifetime you will likely make things much worse for yourself, you save yourself 10 minutes for each project but end up spending hours tracking down a bug in one.
If you really don't want C++11 then use std::vector<std::vector<std::string>>. It will use a little more memory so you might loose some performance , but most of the time when people are worried about performance they shouldn't be. Are you are calling this function 10,000 time a second? Even then you could gain more performance from threading the code or preallocating memory. Most of the time people think something has bad performance but in reality the computer is optimizing it away, or the CPU is. Is the performance from the memory allocation going to be worse than trying to find the array size every run?
This is also the case with raw pointers vs std::unique_ptr, std::shared_ptr.
If typing all those names looks like a pain, use a typedef to make it nice.
You can also look at using Boost's Array type, boost::array. Or whip up your own custom class.
That's not to say that you should never use that stuff. But you should only use it when you can justify it. The default should be the 'pure' C++ style code.
Performance (only when you have measured and see that you need it there).
C compatibility (but most of the time you can just wrap that stuff in the std classes anyway).
If you do feel you need it then. Make sure you unittest your code. And look at using the address and memory sanitizers that ship in current versions of gcc and clang. And quarantine the code as much as possible (ie in classe)s.
That all sounds like a lot of work, but once you have learned to do it, it becomes a habit and build it into your build system then it's just part of the development process. As easy as make test. And once you have it in one build system, you cut and paste it into everything else you do forever. You have expanded your programmers toolkit. That's all good habits to form even if you don't do that.
But here's the actual answer to your array size question:
std::string arr[][10] = {
{"xxx", "111"},
{"y", "222"},
{"hello", "goodbye"},
{"I like candy", "mmmm"},
{"Math goes here", "this is math"},
{"More random stuff", "adsfdsfasf"},
};
int size = sizeof(arr) / 10 / sizeof(std::string);
std::cout << size << endl; // Prints 6, as in 6 pairs of strings
Since the semantics is similar as Map ( you are mapping a function to it's differential), I guess most suitable data structure would be std::map, when you can easily get the differential using the function as index.
About the function, you are not appending lastString.
return hx+lastString;
Question 1 is actually quite straightforward:
std::string dMap[][10] = {{"x", "1"}, {"log(x)", "1/x"}, {"e^x", "e^x"}};
size_t tupleCount = sizeof(dMap)/sizeof(dMap[0]);
size_t maxTupleSize = sizeof(dMap[0])/sizeof(dMap[0][0]);
assert(tupleCount == 3);
assert(maxTupleSize == 10);
Note that you won't get the actual count of strings in a tuple this way. You only get the amount of std::strings that can fit into each tuple. Of course, you can search your tuples for the first default constructed std::string it contains. But the entire setup is an invitation for bugs, so you don't want to use it anyways (see below).
Question 2 can also be answered quite clearly. You should be using an std::unordered_map<>. Why?
You usecase is to map strings of one class to another. That is the semantics of either std::map<> or std::unordered_map<>.
From your question I gather that you don't need a notion of a next or previous mapping, your mapping pairs are essentially unrelated. In this case, std::unordered_map<> is simply faster than std::map<> because it uses a hash table internally. No matter how big your std::unordered_map<> gets, looking up its elements takes a constant amount of time. This is not true for std::map<>.
There are two ways of map insertion:
m[key] = val;
Or
m.insert(make_pair(key, val));
My question is, which operation is faster?
People usually say the first one is slower, because the STL Standard at first 'insert' a default element if 'key' is not existing in map and then assign 'val' to the default element.
But I don't see the second way is better because of 'make_pair'. make_pair actually is a convenient way to make 'pair' compared to pair<T1, T2>(key, val). Anyway, both of them do two assignments, one is assigning 'key' to 'pair.first' and two is assigning 'val' to 'pair.second'. After pair is made, map inserts the element initialized by 'pair.second'.
So the first way is 1. 'default construct of typeof(val)' 2. assignment
the second way is 1. assignment 2. 'copy construct of typeof(val)'
Both accomplish different things.
m[key] = val;
Will insert a new key-value pair if the key doesn't exist already, or it will overwrite the old value mapped to the key if it already exists.
m.insert(make_pair(key, val));
Will only insert the pair if key doesn't exist yet, it will never overwrite the old value. So, choose accordingly to what you want to accomplish.
For the question what is more efficient: profile. :P Probably the first way I'd say though. The assignment (aka copy) is the case for both ways, so the only difference lies in construction. As we all know and should implement, a default construction should basically be a no-op, and thus be very efficient. A copy is exactly that - a copy. So in way one we get a "no-op" and a copy, and in way two we get two copies.
Edit: In the end, trust what your profiling tells you. My analysis was off like #Matthieu mentions in his comment, but that was my guessing. :)
Then, we have C++0x coming, and the double-copy on the second way will be naught, as the pair can simply be moved now. So in the end, I think it falls back on my first point: Use the right way to accomplish the thing you want to do.
On a lightly loaded system with plenty of memory, this code:
#include <map>
#include <iostream>
#include <ctime>
#include <string>
using namespace std;
typedef map <unsigned int,string> MapType;
const unsigned int NINSERTS = 1000000;
int main() {
MapType m1;
string s = "foobar";
clock_t t = clock();
for ( unsigned int i = 0; i < NINSERTS; i++ ) {
m1[i] = s;
}
cout << clock() - t << endl;
MapType m2;
t = clock();
for ( unsigned int i = 0; i < NINSERTS; i++ ) {
m2.insert( make_pair( i, s ) );
}
cout << clock() - t << endl;
}
produces:
1547
1453
or similar values on repeated runs. So insert is (in this case) marginally faster.
Performance wise I think they are mostly the same in general. There may be some exceptions for a map with large objects, in which case you should use [] or perhaps emplace which creates fewer temporary objects than 'insert'. See the discussion here for details.
You can, however, get a performance bump in special cases if you use the 'hint' function on the insert operator. So, looking at option 2 from here:
iterator insert (const_iterator position, const value_type& val);
the 'insert' operation can be reduced to constant time (from log(n)) if you give a good hint (often the case if you know you are adding things at the back of your map).
We have to refine the analysis by mentioning that the relative performance depends on the type(size) of the objects being copied as well.
I did a similar experiment (to nbt) with a map of (int -> set). I know it is a terrible thing to do, but, illustrative for this scenario. The "value", in this case a set of ints, has 20 elements in it.
I execute a million iterations of the []= Vs. insert operations and do RDTSC/iter-count.
[] = set | 10731 cycles
insert(make_pair<>) | 26100 cycles
It shows the magnitude of penalty added due to the copying. Of course, CPP11(move ctor's)
will change the picture.
My take on it:
Worth reminding that maps is a balanced binary tree, most of the modifications and checks take O(logN).
Depends really on the problem you are trying to solve.
1) if you just want to insert the value knowing that it is not there yet,
then [] would do two things:
a) check if the item is there or not
b) if it is not there will create pair and do what insert does (
double work of O( logN ) ), so I would use insert.
2) if you are not sure if it is there or not, then a) if you did check if the item is there by doing something like if( map.find( item ) == mp.end() ) couple of lines above somewhere, then use insert, because of double work [] would perform b) if you didn't check, then it depends, cause insert won't modify the value if it is there, [] will, otherwise they are equal
My answer is not on efficiency but on safety, which is relevant to choosing an insertion algorithm:
The [] and insert() calls would trigger destructors of the elements. This may have dangerous side effects if, say, your destructors have critical behaviors inside.
After such a hazard, I stopped relying on STL's implicit lazy insertion features and always use explicit checks if my objects have behaviors in their ctors/dtors.
See this question:
Destructor called on object when adding it to std::list