Polynomial hash code results in negative numbers? - c++

For large j in certain cases functions the hash function below returns negative values.
int hashing::hash(string a)
{
int i = 0;
int hvalue = 0;
int h =0 ;
while(a[i]!=NULL)
{
hvalue = hvalue + (int(a[i]))*pow(31,i);
i++;
}
h = hvalue%j;
return h;
}
How is that possible? How can I correct it?
In the above code, j is a prime number calculated using the size of the file. The negative values arise in certain specific cases where the string has the form " the s".
What am I doing wrong? How can I fix it?

Remember that int has a finite range and is (usually) a signed value. That means that if you exceed the maximum possible value for an int, it will wrap around and might become negative.
There are a couple of ways you could fix that. First, you could switch to using unsigned ints to hold the hash code, which are never negative and will behave nicely when wrapping around. Alternatively, if you still want to use ints, you can mask off the sign bit (the bit at the front of the number that makes the value negative) by doing this:
return (hvalue & INT_MAX) % j;
(Here, INT_MAX is defined in <climits>). This will ensure your value is positive, though you lose a bit from your hash code, which might for large data sets lead to some more clustering. The reason for doing the & before the mod is that you want to ensure the value is positive before taking the mod, since otherwise you'll overflow the number of buckets.
EDIT: You also have a serious error in your logic. This loops is incorrect:
while(a[i]!=NULL) {
...
}
C++-style strings aren't null-terminated, so this isn't guaranteed to stop once you read past the end of the string. Try changing this to read
for (int i = 0; i < a.length(); i++) {
/* ... process a[i] ... */
}
Hope this helps!

Related

how to insert an element in the beginning when using recursion?

I need to fill an array with the digits of a natural number using recursion. The problem is that i don't understand recursion very well.
int fill(long long number, int arr[10])
{
if(number<10)
{
arr[0]=number;
return arr[10];
}
else
{
arr[0]=number%10;
for(int i=0;i>10;i++)
{
arr[i+1]=arr[i];
}
return fill(number/10, arr);
}
}
If anyone can help in any way it would be much appreciated.
If you have a problem that must be solved with recursion, you probably should not be using for-loops.
The goal is that each iteration of fill() fills in one digit in the right position in the array, and if necessary calls itself again to fill in the remaining digits. You already have the right kind of structure in your code, but it's inefficient because of the extra for-loop. You can avoid it by using the return value of fill() to keep track of where you have to place digits. Here is a possible solution:
int fill(long long number, int arr[10])
{
if (!number)
return 0;
int pos = fill(number / 10, arr);
arr[pos] = number % 10;
return pos + 1;
}
In this implementation, we call ourselves recursively until the number is zero. When it is zero, we return 0. The return value is used to indicate where in the array we have to write a digit. So after we reach the deepest recursion level, and return for the first time, we write the most significant digit to arr[0]. Then we return 0 + 1. That means that one recursion level up, we have pos = 1, and we write the second most significant digit to arr[1], and then we return 1 + 1, and so on until we write the least significant digit, and then we are done. The return value of the initial call to fill() is then equal to the number of digits written to arr.
There are two more issue with this function. The first is when number is larger than 10 digits. In that case, it will write past the end of the array. So you will need to add some check to prevent that from happening, or ensure the array is large enough to hold the largest possible long long value (which is 20 digits if long long is 64-bits). Check LLONG_MAX from the climits.h header to get the maximum value for your platform. The second is that this function doesn't handle negative numbers very well. If you want to ensure it only handles non-negative numbers, change it to use unsigned long long. In that case, be aware that the largest number is ULLONG_MAX, and on 64-bit platforms this probably means 21 digits.
arr[i+1] iterator in for loop will have UB when i=9 and i=10
I am not going to solve it for you, here is how you should think, if you want to master the recursion (which is indeed often hard for people who encounter it for a first time).
Your function is supposed to fill first elements of array with digits, and return the number of digits.
Suppose that the number >= 10. You called fill(number/10, arr). It returned x which is the number of digits in number/10 . What should you do now? what should you return?
Suppose that the number < 10. What should you do? What should you return?

sieve or eratosthenes calculator -- running into memory issues and crashing with numbers >=1,000,000

I'm not exactly sure why this is. I tried changing the variables to long long, and I even tried doing a few other things -- but its either about the inefficiency of my code (it literally does the whole process of finding all primes up to the number, then checking against the number to see if its divisible by that prime -- very inefficient, but its my first attempt at this and I feel pretty accomplished having it work at all....)
Or the fact that it overflows the stack. Im not sure where it is exactly, but all I know is that it MUST be related to memory and the way its dealing with the number.
If I had to guess, Id say its a memory issue happening when it is dealing with the prime number generation up to that number -- thats where it dies even if I remove the check against the input number.
I'll post my code -- just be aware, I didnt change long long back to int in a few places, and I also have a SquareRoot Variable that is not used, because it was supposed to try and help memory efficiency but was not effective the way I tried to do it. I Just never deleted it. I will clean up the code when and if I can successfully finish it.
As far as I am aware though, it DOES work pretty reliably for 999,999 and down, I actually checked it up against other calculators of the same type and it seemingly does generate the proper answers.
If anyone can help or explain what I screwed up here, your helping a guy trying to learn on his own without any school or anything. so its appreciated.
#include <iostream>
#include <cmath>
void sieve(int ubound, int primes[]);
int main()
{
long long n;
int i;
std::cout << "Input Number: ";
std::cin >> n;
if (n < 2) {
return 1;
}
long long upperbound = n;
int A[upperbound];
int SquareRoot = sqrt(upperbound);
sieve(upperbound, A);
for (i = 0; i < upperbound; i++) {
if (A[i] == 1 && upperbound % i == 0) {
std::cout << " " << i << " ";
}
}
return 0;
}
void sieve(int ubound, int primes[])
{
long long i, j, m;
for (i = 0; i < ubound; i++) {
primes[i] = 1;
}
primes[0] = 0, primes[1] = 0;
for (i = 2; i < ubound; i++) {
for(j = i * i; j < ubound; j += i) {
primes[j] = 0;
}
}
}
If you used legal C++ constructs instead of non-standard variable length arrays, your code will run (whether it produces the correct answers is another question).
The issue is more than likely that you're exceeding the limits of the stack when you declare arrays with a million or more elements.
Therefore instead of this:
long long upperbound = n;
A[upperbound];
Use std::vector:
#include <vector>
//...
long long upperbound = n;
std::vector<int> A(upperbound);
and then:
sieve(upperbound, A.data());
The std::vector does not use the stack space to allocate its elements (unless you have written an allocator for it that uses the stack).
As a matter of fact, you don't even need to pass upperbound to sieve, as a std::vector knows its own size by calling the size() member function. But I leave that as an exercise.
Live example using 2,000,000
First of all, read and apply PaulMcKenzie's advice. That's the most important thing. I'm only addressing some teeny bits of your question that remained open.
It seems that you are trying to factor the number that you misleadingly called upperbound. The mysterious role of the square root of this number is related to this fact: if the number is composite at all - and hence can be computed as the product of some prime factors - then the smallest of these prime factors cannot be greater than the square root of the number. In fact, only one factor can possibly be greater, all others cannot exceed the square root.
However, in its present form your code cannot draw advantage from this fact. The trial division loop as it stands now has to run up to number_to_be_factored / 2 in order not to miss any factors because its body looks like this:
if (sieve[i] == 1 && number_to_be_factored % i == 0) {
std::cout << " " << i << " ";
}
You can factor much more efficiently if you refactor your code a bit: when you have found the smallest prime factor p of your number then the remaining factors to be found must be precisely those of rest = number_to_be_factored / p (or n = n / p, if you will), and none of the remaining factors can be smaller than p. However, don't forget that p might occur more than once as a factor.
During any round of the proceedings you only need to consider the prime factors between p and the square root of the current number; if none of those primes divides the current number then it must be prime. To test whether p exceeds the square root of some number n you can use if (p * p > n), which is computationally more efficient that actually computing the square root.
Hence the square root occurs in two different roles:
the square root of the number to be factored limits the amount of sieving that needs to be done
during the trial division loop, the square root of the current number gives an upper bound for the highest prime factor that you need to consider
That's two faces of the same coin but two different usages in the actual code.
Note: once you got your code working by applying PaulMcKenzie's advice, you might also to consider posting over on Code Review.

C++ Adding big numbers together with operator overload

I am new to C++ and attempting to create a "BigInt" class. I decided to base most of the implementation on reading the numbers into vectors.
So far I have only written the copy constructor for an input string.
Largenum::Largenum(std::string input)
{
for (std::string::const_iterator it = input.begin(); it!=input.end(); ++it)
{
number.push_back(*it- '0');
}
}
The problem I am having is with the addition function. I have created a function which seems to work after I tested it a few times, but as you can see its highly inefficient. I have 2 different vectors such as:
std::vector<int> x = {1,3,4,5,9,1};
std::vector<int> y = {2,4,5,6};
The way I thought to solve this problem was to add 0s before the shorter, in this case y vector to make both vectors have the same size such as:
x = {1,3,4,5,9,1};
y = {0,0,2,4,5,6};
Then to add them using elementary style addition.
I don't want to add 0s infront of vector Y as it would be slow with a large number. My current solution is to reverse the vector, then push_back the appropriate amount of 0s, then reverse it back. This may be slower then simply inserting at the front it seems, I have not tested yet.
The problem is that after I do all of the addition on the vectors and push_back the result. I am left with a backward vector and I need to use reverse yet again! There has got to be a much better way then my method but I am stuck on finding it. Ideally I would make A const as well. Here is the code of the function:
Largenum Largenum::operator+(Largenum &A)
{
bool carry = 0;
Largenum sum;
std::vector<int>::size_type max = std::max(A.number.size(), this->number.size());
std::vector<int>::size_type diff = std::abs (A.number.size()-this->number.size());
if (A.number.size()>this->number.size())
{
std::reverse(this->number.begin(), this->number.end());
for (std::vector<int>::size_type i = 0; i<(max-diff); ++i) this->number.push_back(0);
std::reverse(this->number.begin(), this->number.end());
}
else if (this->number.size() > A.number.size())
{
std::reverse(A.number.begin(), A.number.end());
for (std::vector<int>::size_type i = 0; i<(max-diff); ++i) A.number.push_back(0);
std::reverse(A.number.begin(), A.number.end());
}
for (std::vector<int>::size_type i = max; i!=0; --i)
{
int num = (A.number[i-1] + this->number[i-1] + carry)%10;
sum.number.push_back(num);
(A.number[i-1] + this->number[i-1] + carry >= 10) ? carry = 1 : carry = 0;
}
if (carry) sum.number.push_back(1);
reverse(sum.number.begin(), sum.number.end());
return sum;
}
If anyone has any input that would be great, this is my first program using classes in C++ and its fairly overwhelming.
I think your function is quite close to the most optimal one I have seen. Still here are few suggestions how to improve it:
Decimal numeric system is quite inefficient, you have a lot of digits for big numbers. Better use a higher base to reduce the number of digits you have to add. Reading and writing such numbers in human readable representation will be a bit harder, but you will optimize the operations several times, because you will have less digits.
When implementing big integers I represent them in reverse order, thus I have the least significant digit at position with index 0, and the most significant one at the end of the array. This way when carry forces you to add a new digit you only perform a push_back, not a whole reverse.
One issue: integer modulus is pretty slow on modern processors, even compared to branch misprediction. Rather than doing an explicit %10, try this for your third for-loop:
int num = A.number[i-1] + this->number[i-1] + carry;
if(num >= 10)
{
carry = 1;
num -= 10;
}
else
{
carry = 0;
}
sum.number.push_back(num);

Long array performance issue

I have an array of char pointers of length 175,000. Each pointer points to a c-string array of length 100, each character is either 1 or 0. I need to compare the difference between the strings.
char* arr[175000];
So far, I have two for loops where I compare every string with every other string. The comparison functions basically take two c-strings and returns an integer which is the number of differences of the arrays.
This is taking really long on my 4-core machine. Last time I left it to run for 45min and it never finished executing. Please advise of a faster solution or some optimizations.
Example:
000010
000001
have a difference of 2 since the last two bits do not match.
After i calculate the difference i store the value in another array
int holder;
for(int x = 0;x < UsedTableSpace; x++){
int min = 10000000;
for(int y = 0; y < UsedTableSpace; y++){
if(x != y){
//compr calculates difference between two c-string arrays
int tempDiff =compr(similarity[x]->matrix, similarity[y]->matrix);
if(tempDiff < min){
min = tempDiff;
holder = y;
}
}
}
similarity[holder]->inbound++;
}
With more information, we could probably give you better advice, but based on what I understand of the question, here are some ideas:
Since you're using each character to represent a 1 or a 0, you're using several times more memory than you need to use, which creates a big performance impact when it comes to caching and such. Instead, represent your data using numeric values that you can think of in terms of a series of bits.
Once you've implemented #1, you can grab an entire integer or long at a time and do a bitwise XOR operation to end up with a number that has a 1 in every place where the two numbers didn't have the same values. Then you can use some of the tricks mentioned here to count these bits speedily.
Work on "unrolling" your loops somewhat to avoid the number of jumps necessary. For example, the following code:
total = total + array[i];
total = total + array[i + 1];
total = total + array[i + 2];
... will work faster than just looping over total = total + array[i] three times. Jumps are expensive, and interfere with the processor's pipelining. Update: I should mention that your compiler may be doing some of this for you already--you can check the compiled code to see.
Break your overall data set into chunks that will allow you to take full advantage of caching. Think of your problem as a "square" with the i index on one axis and the j axis on the other. If you start with one i and iterate across all 175000 j values, the first j values you visit will be gone from the cache by the time you get to the end of the line. On the other hand, if you take the top left corner and go from j=0 to 256, most of the values on the j axis will still be in a low-level cache as you loop around to compare them with i=0, 1, 2, etc.
Lastly, although this should go without saying, I guess it's worth mentioning: Make sure your compiler is set to optimize!
One simple optimization is to compare the strings only once. If the difference between A and B is 12, the difference between B and A is also 12. Your running time is going to drop almost half.
In code:
int compr(const char* a, const char* b) {
int d = 0, i;
for (i=0; i < 100; ++i)
if (a[i] != b[i]) ++d;
return d;
}
void main_function(...) {
for(int x = 0;x < UsedTableSpace; x++){
int min = 10000000;
for(int y = x + 1; y < UsedTableSpace; y++){
//compr calculates difference between two c-string arrays
int tempDiff = compr(similarity[x]->matrix, similarity[y]->matrix);
if(tempDiff < min){
min = tempDiff;
holder = y;
}
}
similarity[holder]->inbound++;
}
}
Notice the second-th for loop, I've changed the start index.
Some other optimizations is running the run method on separate threads to take advantage of your 4 cores.
What is your goal, i.e. what do you want to do with the Hamming Distances (which is what they are) after you've got them? For example, if you are looking for the closest pair, or most distant pair, you probably can get an O(n ln n) algorithm instead of the O(n^2) methods suggested so far. (At n=175000, n^2 is 15000 times larger than n ln n.)
For example, you could characterize each 100-bit number m by 8 4-bit numbers, being the number of bits set in 8 segments of m, and sort the resulting 32-bit signatures into ascending order. Signatures of the closest pair are likely to be nearby in the sorted list. It is easy to lower-bound the distance between two numbers if their signatures differ, giving an effective branch-and-bound process as less-distant numbers are found.

How can I remove the leading zeroes from an integer generated by a loop and store it as an array?

I have a for loop generating integers.
For instance:
for (int i=300; i>200; i--)
{(somefunction)*i=n;
cout<<n;
}
This produces an output on the screen like this:
f=00000000000100023;
I want to store the 100023 part of this number (i.e just ignore all the zeros before the non zero numbers start but then keeping the zeros which follow) as an array.
Like this:
array[0]=1;
array[1]=0;
array[2]=0;
array[3]=0;
array[4]=2;
array[5]=3;
How would I go about achieving this?
This is a mish-mash of answers, because they are all there, I just don't think you're seeing the solution.
First off, if they are integers Bill's answer along with the other answers are great, save some of them skip out on the "store in array" part. Also, as pointed out in a comment on your question, this part is a duplicate.
But with your new code, the solution I had in mind was John's solution. You just need to figure out how to ignore leading zero's, which is easy:
std::vector<int> digits;
bool inNumber = false;
for (int i=300; i>200; i--)
{
int value = (somefunction) * i;
if (value != 0)
{
inNumber = true; // its not zero, so we have entered the number
}
if (inNumber)
{
// this code cannot execute until we hit the first non-zero number
digits.push_back(value);
}
}
Basically, just don't start pushing until you've reached the actual number.
In light of the edited question, my original answer (below) isn't the best. If you absolutely have to have the output in an array instead of a vector, you can start with GMan's answer then transfer the resulting bytes to an array. You could do the same with JohnFx's answer once you find the first non-zero digit in his result.
I'm assuming f is of type int, in which case it doesn't store the leading zeroes.
int f = 100023;
To start you need to find the required length of the array. You can do that by taking the log (base 10) of f. You can import the cmath library to use the log10 function.
int length = log10(f);
int array[length];
length should now be 6.
Next you can strip each digit from f and store it in the array using a loop and the modulus (%) operator.
for(int i=length-1; i >= 0; --i)
{
array[i] = f % 10;
f = f / 10;
}
Each time through the loop, the modulus takes the last digit by returning the remainder from division by 10. The next line divides f by 10 to get ready for the next iteration of the loop.
The straightforward way would be
std::vector<int> vec;
while(MyInt > 0)
{
vec.push_back(MyInt%10);
MyInt /= 10;
}
which stores the decimals in reverse order (vector used to simplify my code).
Hang on a second. If you wrote the code generating the integers, why bother parsing it back into an array?
Why not just jam the integers into an array in your loop?
int array[100];
for (int i=300; i>200; i--)
{
array[i]= (somefunction)*i;
}
Since the leading zeros are not kept because it represents the same number
See: convert an integer number into an array