C++: First non-repeating character, O(n) time using hash map

C++: First non-repeating character, O(n) time using hash map - c++

I'm trying to write a function to get the first non-repeating character of a string. I haven't found a satisfactory answer on how to do this in O(n) time for all cases. My current solution is:
char getFirstNonRepeated(char * str) {
if (strlen(str) > 0) {
int visitedArray[256] = {}; // Where 256 is the size of the alphabet
for (int i = 0; i < strlen(str); i++) {
visitedArray[str[i]] += 1;
}
for (int j = 0; j < 256; j++) {
if (visitedArray[j] == 1) return j;
}
}
return '\0'; // Either strlen == 0 or all characters are repeated
}
However, as long as n < 256, this algorithm runs in O(n^2) time in the worst case. I've read that using a hash table instead of an array to store the number of times each character is visited could get the algorithm to run consistently in O(n) time, because insertions, deletions, and searches on hash tables run in O(1) time. I haven't found a question that explains how to do this properly. I don't have very much experience using hash maps in C++ so any help would be appreciated.

Why are you repeating those calls to strlen() in every loop? That is linear with the length of the string, so your first loop effectively becomes O(n^2) for no good reason at all. Just calculate the length once and store it, or use str[i] as the end condition.
You should also be aware that if your compiler uses signed characters, any character value above 127 will be considered negative (and used as a negative, i.e. out of bounds, array offset). You can avoid this by explicitly casting your character values to be unsigned char.

Related

Program taking too much time

I'm trying to solve a coding problem, the problem is:
Take a string input
Take a number input 'n'
Repeat the string up to n indexes
Count the number of 'a' characters that occur in the repeated string
This problem was authored by tunyash on Hackerrank with title 'Repeated String'
My current solution is taking too much time to run
This is what I am currently doing:
Use a variable to iterate through the original string
Each time the variable exceeds the original string length, reset it to 0
Iterate n times
I've made a function to do the counting as follows:
long long repeatedString(std::string s, long long n) {
long long sIndex{ 0 }, length = s.size(), result{ 0 };
for (long long i = 0; i < n; i++)
{
if (sIndex > (length - 1))
sIndex = 0;
if (s[sIndex] == 'a')
result += 1;
sIndex += 1;
}
return result;
}
I've tried modifying and using binary search algorithm by first writing the whole string then searching but the writing part takes too much time and seems not very intuitive

This is a typical beginner programming exercise. The idea is that you shouldn't blindly overengineer the problem, when a simple mathematical formula is right around the corner. In this case you can simply count the number of a's in the original string and multiply it by n to get the desired result:
std::count(s.begin(), s.end(), 'a') * n
where s is your input string.
Edit: I misinterpreted the question. I assumed n was the number of repetitions of the whole string, whereas it actually was the number of characters to concatenate by modularly concatenating the strings characters up until n number of characters. In this case, simply divide before multiplying: n / s.length() and adjust for the n % s.length() characters remaining with addition. I will leave this as an exercise.

Big-O Analysis Of Function, Replace All Vowels in a String With A Char

I am not sure if I am doing my big-O analysis correctly.
This is a function which replaces all the vowels in a string with a specified character.
I have chosen to compare every character in the string to a string of constant size which contains all of the vowels.
Given that the input string can scale upwards in size, but the vowel string is constant in size, I think the big-O analysis is O(n * m) rather than O(n * n), where n is the input string, and m is the vowel string.
I am thinking that it should just be O(n) and not even O(n * m), given that the second for loop iterates over a constant number of elements, so that would be dropped?
I'd greatly appreciate it if someone can correct me.
using namespace std;
string replaceVowels(string str, char ch) {
string vowels = "aeiouyAEIOUY";
for(int i = 0; i < str.size(); i++) { //O(n)
for(int j = 0; j < vowels.size(); j++) { //O(m)
if (str[i] == vowels[j])
str[i] = ch;
}
}
// O(n * m) or O(n) or O(n * n)?
return str;
}

In Big O notation you can ignore all constant (strictly positive) factors.
This means that (since m=12 is constant) O(n * m) = O(n), so both would technically be correct, but of course O(n) is what one would say (much like we were taught to answer "one half" instead of "two quarters" in elementary school).
That is it, at least in theory. In practice however, it is sometimes a subjective task to determine what counts as a constant, and what can be arbitrarily large. In this case for example, we use the fact that sizeof(int) is constant (otherwise i < str.size() wouldn't take constant time), but we still assume that n has no upper bound (otherwise O(n)=O(1)). Technically, these assumptions can't both be true, but we make them anyway in much the same way we assume 0 air friction in physics to fit a simpler model.

Why does the longest prefix which is also suffix calculation part in the KMP have a time complexity of O(n) and not O(n^2)?

I was going through the code of KMP when I noticed the Longest Prefix which is also suffix calculation part of KMP. Here is how it goes,
void computeLPSArray(char* pat, int M, int* lps)
{
int len = 0;
lps[0] = 0;
int i = 1;
while (i < M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else
{
if (len != 0) {
len = lps[len - 1]; //<----I am referring to this part
}
else
{
lps[i] = 0;
i++;
}
}
}
}
Now the part where I got confused was the one which I have shown in comments in the above code. Now we do know that when a code contains a loop like the following
int a[m];
memset(a, 0, sizeof(a));
for(int i = 0; i<m; i++){
for(int j = i; j>=0; j--){
a[j] = a[j]*2;//This inner loop is causing the same cells in the 1
//dimensional array to be visited more than once.
}
}
The complexity comes out to be O(m*m).
Similarly if we write the above LPS computation in the following format
while(i<M){
if{....}
else{
if(len != 0){
//doesn't this part cause the code to again go back a few elements
//in the LPS array the same way as the inner loop in my above
//written nested for loop does? Shouldn't that mean the same cell
//in the array is getting visited more than once and hence the
//complexity should increase to O(M^2)?
}
}
}
It might be that the way I think complexities are calculated is wrong. So please clarify.

If expressions do not take time that grows with len.
Len is an integer. Reading it takes O(1) time.
Array indexing is O(1).
Visiting something more than once does not mean you are higher O notation wise. Only if the visit count grows faster than kn for some k.

If you carefully analyze the algorithm of creating prefix table, you may notice that the total number of rollbacked positions could be m at most, so the upper bound for total number of iterations is 2*m which yields O(m)
Value of len grows alongside the main iterator i and whenever there is a mismatch, len drops back to zero value but this "drop" cannot exceed the interval passed by the main iterator i since the start of match.
For example, let's say, the main iterator i started matching with len at position 5 and mismatched at position 20.
So,
LPS[5]=1
LPS[6]=2
...
LPS[19]=15
At the moment of mismatch, len has a value of 15. Hence it may rollback at most 15 positions down to zero, which is equivalent to the interval passed by i while matching. In other words, on every mismatch, len travels back no more than i has traveled forward since the start of match

Finding repeating signed integers with O(n) in time and O(1) in space

(This is a generalization of: Finding duplicates in O(n) time and O(1) space)
Problem: Write a C++ or C function with time and space complexities of O(n) and O(1) respectively that finds the repeating integers in a given array without altering it.
Example: Given {1, 0, -2, 4, 4, 1, 3, 1, -2} function must print 1, -2, and 4 once (in any order).
EDIT: The following solution requires a duo-bit (to represent 0, 1, and 2) for each integer in the range of the minimum to the maximum of the array. The number of necessary bytes (regardless of array size) never exceeds (INT_MAX – INT_MIN)/4 + 1.
#include <stdio.h>
void set_min_max(int a[], long long unsigned size,\
int* min_addr, int* max_addr)
{
long long unsigned i;
if(!size) return;
*min_addr = *max_addr = a[0];
for(i = 1; i < size; ++i)
{
if(a[i] < *min_addr) *min_addr = a[i];
if(a[i] > *max_addr) *max_addr = a[i];
}
}
void print_repeats(int a[], long long unsigned size)
{
long long unsigned i;
int min, max = min;
long long diff, q, r;
char* duos;
set_min_max(a, size, &min, &max);
diff = (long long)max - (long long)min;
duos = calloc(diff / 4 + 1, 1);
for(i = 0; i < size; ++i)
{
diff = (long long)a[i] - (long long)min; /* index of duo-bit
corresponding to a[i]
in sequence of duo-bits */
q = diff / 4; /* index of byte containing duo-bit in "duos" */
r = diff % 4; /* offset of duo-bit */
switch( (duos[q] >> (6 - 2*r )) & 3 )
{
case 0: duos[q] += (1 << (6 - 2*r));
break;
case 1: duos[q] += (1 << (6 - 2*r));
printf("%d ", a[i]);
}
}
putchar('\n');
free(duos);
}
void main()
{
int a[] = {1, 0, -2, 4, 4, 1, 3, 1, -2};
print_repeats(a, sizeof(a)/sizeof(int));
}

The definition of big-O notation is that its argument is a function (f(x)) that, as the variable in the function (x) tends to infinity, there exists a constant K such that the objective cost function will be smaller than Kf(x). Typically f is chosen to be the smallest such simple function such that the condition is satisfied. (It's pretty obvious how to lift the above to multiple variables.)
This matters because that K — which you aren't required to specify — allows a whole multitude of complex behavior to be hidden out of sight. For example, if the core of the algorithm is O(n2), it allows all sorts of other O(1), O(logn), O(n), O(nlogn), O(n3/2), etc. supporting bits to be hidden, even if for realistic input data those parts are what actually dominate. That's right, it can be completely misleading! (Some of the fancier bignum algorithms have this property for real. Lying with mathematics is a wonderful thing.)
So where is this going? Well, you can assume that int is a fixed size easily enough (e.g., 32-bit) and use that information to skip a lot of trouble and allocate fixed size arrays of flag bits to hold all the information that you really need. Indeed, by using two bits per potential value (one bit to say whether you've seen the value at all, another to say whether you've printed it) then you can handle the code with fixed chunk of memory of 1GB in size. That will then give you enough flag information to cope with as many 32-bit integers as you might ever wish to handle. (Heck that's even practical on 64-bit machines.) Yes, it's going to take some time to set that memory block up, but it's constant so it's formally O(1) and so drops out of the analysis. Given that, you then have constant (but whopping) memory consumption and linear time (you've got to look at each value to see whether it's new, seen once, etc.) which is exactly what was asked for.
It's a dirty trick though. You could also try scanning the input list to work out the range allowing less memory to be used in the normal case; again, that adds only linear time and you can strictly bound the memory required as above so that's constant. Yet more trickiness, but formally legal.
[EDIT] Sample C code (this is not C++, but I'm not good at C++; the main difference would be in how the flag arrays are allocated and managed):
#include <stdio.h>
#include <stdlib.h>
// Bit fiddling magic
int is(int *ary, unsigned int value) {
return ary[value>>5] & (1<<(value&31));
}
void set(int *ary, unsigned int value) {
ary[value>>5] |= 1<<(value&31);
}
// Main loop
void print_repeats(int a[], unsigned size) {
int *seen, *done;
unsigned i;
seen = calloc(134217728, sizeof(int));
done = calloc(134217728, sizeof(int));
for (i=0; i<size; i++) {
if (is(done, (unsigned) a[i]))
continue;
if (is(seen, (unsigned) a[i])) {
set(done, (unsigned) a[i]);
printf("%d ", a[i]);
} else
set(seen, (unsigned) a[i]);
}
printf("\n");
free(done);
free(seen);
}
void main() {
int a[] = {1,0,-2,4,4,1,3,1,-2};
print_repeats(a,sizeof(a)/sizeof(int));
}

Since you have an array of integers you can use the straightforward solution with sorting the array (you didn't say it can't be modified) and printing duplicates. Integer arrays can be sorted with O(n) and O(1) time and space complexities using Radix sort. Although, in general it might require O(n) space, the in-place binary MSD radix sort can be trivially implemented using O(1) space (look here for more details).

The O(1) space constraint is intractable.
The very fact of printing the array itself requires O(N) storage, by definition.
Now, feeling generous, I'll give you that you can have O(1) storage for a buffer within your program and consider that the space taken outside the program is of no concern to you, and thus that the output is not an issue...
Still, the O(1) space constraint feels intractable, because of the immutability constraint on the input array. It might not be, but it feels so.
And your solution overflows, because you try to memorize an O(N) information in a finite datatype.

There is a tricky problem with definitions here. What does O(n) mean?
Konstantin's answer claims that the radix sort time complexity is O(n). In fact it is O(n log M), where the base of the logarithm is the radix chosen, and M is the range of values that the array elements can have. So, for instance, a binary radix sort of 32-bit integers will have log M = 32.
So this is still, in a sense, O(n), because log M is a constant independent of n. But if we allow this, then there is a much simpler solution: for each integer in the range (all 4294967296 of them), go through the array to see if it occurs more than once. This is also, in a sense, O(n), because 4294967296 is also a constant independent of n.
I don't think my simple solution would count as an answer. But if not, then we shouldn't allow the radix sort, either.

I doubt this is possible. Assuming there is a solution, let's see how it works. I'll try to be as general as I can and show that it can't work... So, how does it work?
Without losing generality we could say we process the array k times, where k is fixed. The solution should also work when there are m duplicates, with m >> k. Thus, in at least one of the passes, we should be able to output x duplicates, where x grows when m grows. To do so, some useful information has been computed in a previous pass and stored in the O(1) storage. (The array itself can't be used, this would give O(n) storage.)
The problem: we have O(1) of information, when we walk over the array we have to identify x numbers(to output them). We need a O(1) storage than can tell us in O(1) time, if an element is in it. Or said in a different way, we need a data structure to store n booleans (of wich x are true) that uses O(1) space, and takes O(1) time to query.
Does this data structure exists? If not, then we can't find all duplicates in an array with O(n) time and O(1) space (or there is some fancy algorithm that works in a completely different manner???).

I really don't see how you can have only O(1) space and not modify the initial array. My guess is that you need an additional data structure. For example, what is the range of the integers? If it's 0..N like in the other question you linked, you can have an additinal count array of size N. Then in O(N) traverse the original array and increment the counter at the position of the current element. Then traverse the other array and print the numbers with count >= 2. Something like:
int* counts = new int[N];
for(int i = 0; i < N; i++) {
counts[input[i]]++;
}
for(int i = 0; i < N; i++) {
if(counts[i] >= 2) cout << i << " ";
}
delete [] counts;

Say you can use the fact you are not using all the space you have. You only need one more bit per possible value and you have lots of unused bit in your 32-bit int values.
This has serious limitations, but works in this case. Numbers have to be between -n/2 and n/2 and if they repeat m times, they will be printed m/2 times.
void print_repeats(long a[], unsigned size) {
long i, val, pos, topbit = 1 << 31, mask = ~topbit;
for (i = 0; i < size; i++)
a[i] &= mask;
for (i = 0; i < size; i++) {
val = a[i] & mask;
if (val <= mask/2) {
pos = val;
} else {
val += topbit;
pos = size + val;
}
if (a[pos] < 0) {
printf("%d\n", val);
a[pos] &= mask;
} else {
a[pos] |= topbit;
}
}
}
void main() {
long a[] = {1, 0, -2, 4, 4, 1, 3, 1, -2};
print_repeats(a, sizeof (a) / sizeof (long));
}
prints
4
1
-2

Given an array of integers, find the first integer that is unique

Given an array of integers, find the first integer that is unique.
my solution: use std::map
put integer (number as key, its index as value) to it one by one (O(n^2 lgn)), if have duplicate, remove the entry from the map (O(lg n)), after putting all numbers into the map, iterate the map and find the key with smallest index O(n).
O(n^2 lgn) because map needs to do sorting.
It is not efficient.
other better solutions?

I believe that the following would be the optimal solution, at least based on time / space complexity:
Step 1:
Store the integers in a hash map, which holds the integer as a key and the count of the number of times it appears as the value. This is generally an O(n) operation and the insertion / updating of elements in the hash table should be constant time, on the average. If an integer is found to appear more than twice, you really don't have to increment the usage count further (if you don't want to).
Step 2:
Perform a second pass over the integers. Look each up in the hash map and the first one with an appearance count of one is the one you were looking for (i.e., the first single appearing integer). This is also O(n), making the entire process O(n).
Some possible optimizations for special cases:
Optimization A: It may be possible to use a simple array instead of a hash table. This guarantees O(1) even in the worst case for counting the number of occurrences of a particular integer as well as the lookup of its appearance count. Also, this enhances real time performance, since the hash algorithm does not need to be executed. There may be a hit due to potentially poorer locality of reference (i.e., a larger sparse table vs. the hash table implementation with a reasonable load factor). However, this would be for very special cases of integer orderings and may be mitigated by the hash table's hash function producing pseudorandom bucket placements based on the incoming integers (i.e., poor locality of reference to begin with).
Each byte in the array would represent the count (up to 255) for the integer represented by the index of that byte. This would only be possible if the difference between the lowest integer and the highest (i.e., the cardinality of the domain of valid integers) was small enough such that this array would fit into memory. The index in the array of a particular integer would be its value minus the smallest integer present in the data set.
For example on modern hardware with a 64-bit OS, it is quite conceivable that a 4GB array can be allocated which can handle the entire domain of 32-bit integers. Even larger arrays are conceivable with sufficient memory.
The smallest and largest integers would have to be known before processing, or another linear pass through the data using the minmax algorithm to find out this information would be required.
Optimization B: You could optimize Optimization A further, by using at most 2 bits per integer (One bit indicates presence and the other indicates multiplicity). This would allow for the representation of four integers per byte, extending the array implementation to handle a larger domain of integers for a given amount of available memory. More bit games could be played here to compress the representation further, but they would only support special cases of data coming in and therefore cannot be recommended for the still mostly general case.

All this for no reason. Just using 2 for-loops & a variable would give you a simple O(n^2) algo.
If you are taking all the trouble of using a hash map, then it might as well be what #Micheal Goldshteyn suggests
UPDATE: I know this question is 1 year old. But was looking through the questions I answered and came across this. Thought there is a better solution than using a hashtable.
When we say unique, we will have a pattern. Eg: [5, 5, 66, 66, 7, 1, 1, 77]. In this lets have moving window of 3. first consider (5,5,66). we can easily estab. that there is duplicate here. So move the window by 1 element so we get (5,66,66). Same here. move to next (66,66,7). Again dups here. next (66,7,1). No dups here! take the middle element as this has to be the first unique in the set. The left element belongs to the dup so could 1. Hence 7 is the first unique element.
space: O(1)
time: O(n) * O(m^2) = O(n) * 9 ≈ O(n)

Inserting to a map is O(log n) not O(n log n) so inserting n keys will be n log n. also its better to use set.

Although it's O(n^2), the following has small coefficients, isn't too bad on the cache, and uses memmem() which is fast.
for(int x=0;x<len-1;x++)
if(memmem(&array[x+1], sizeof(int)*(len-(x+1)), array[x], sizeof(int))==NULL &&
memmem(&array[x+1], sizeof(int)*(x-1), array[x], sizeof(int))==NULL)
return array[x];

public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if (input[i]==input[j])
{
dupIndex[j] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}

#user3612419
Solution given you is good with some what close to O(N*N2) but further optimization in same code is possible I just added two-3 lines that you missed.
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if(dupIndex[j]==true)
{
continue;
}
if (input[i]==input[j])
{
dupIndex[j] = true;
dupIndex[i] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js