Subtracting .size() function values of different strings

Subtracting .size() function values of different strings - c++

Recently I encountered a problem while I was trying to subtract .size() values of two strings in c++. As far as I know, size() returns number of characters in a string. So lets say I have 2 strings p and q, abs(p.size()-q.size()) should return me difference in length of both strings. But when I ran this code, it returned an abruptly large value. When I individually print the length of both or if I store their length values in different integers and subtract them, they give me correct answer. Am not yet able to figure out why.

size() returns an unsigned value. A smaller unsigned value minus a larger one is then underflowing the calculation, resulting in a large negative value. Think of it as if you have the "rolling" counter of miles or km in a car, and you roll back past 0, it becomes 99999, which is a big number.
The solution, assuming you care about negative differences is to do static_cast<int>(p.size() - q.size()) (and pass that to abs).

Return Value of size() is the number of size_t (an unsigned integral type)
So if you subtract greater number from smaller number, you'll get into problem and get that big value as a result of subtraction.
Reference std::string::size

std::string member function size() returns an unsigned value, so if p.size() < q.size(), the expression p.size()-q.size() will not evaluate to a negative number (it's unsigned, cannot be negative) but to a (often) very very big (unsigned) number.

std::strings reports their size as some width of unsigned integer; such types are a bit like the second hand on a watch: you can wind it forward from 0 up to 59 but if you keep going clockwise it drops to 0 before incrementing again, while if you wind counterclockwise you count down to 0 then jump to 59 and count down from there, ad infinitum.
Say you are subtracting a string length of 6 from a string length of 4, it's much like saying "start the minute hand at 4 and wind counterclockwise by 6 minutes" - when you've wound back 4 minutes the second hand's already at 0, and you wind another minute to get to 59, and the final minute brings you to 58. For std::string::size_type the maximum isn't 59 - it's much larger - but the problem's the same. The result is always positive so is unaffected by abs, but regardless - not what you wanted!
The actual maximum value can be accessed after #include <limits> with std::numeric_limits<std::string::size_type>::max(), for whatever that's worth.
There are many ways to solve this problem. David Schwartz's comment on Zola's answer lists one good one: std::max(p.size(),q.size())-std::min(p.size(),q.size()), which you can think of as "subtract the smaller value from the larger value". Another option is...
p.size() > q.size() ? p.size() - q.size() : q.size() - p.size()
...which means "if p's larger, subtract q from it, otherwise subtract it (i.e. p) from q".

Related

how do i find collision for a simple hash algorithm

I have the following hash algorithm:
unsigned long specialNum=0x4E67C6A7;
unsigned int ch;
char inputVal[]=" AAPB2GXG";
for(int i=0;i<strlen(inputVal);i++)
{
ch=inputVal[i];
ch=ch+(specialNum*32);
ch=ch+(specialNum/4);
specialNum=bitXor(specialNum,ch);
}
unsigned int outputVal=specialNum;
The bitXor simply does the Xor operation:
int bitXor(int a,int b)
{
return (a & ~b) | (~a & b);
}
Now I want to find an Algorithm that can generate an "inputVal" when the outputVal is given.(The generated inputVal may not be necessarily be same as the original inputVal.That's why I want to find collision).
This means that I need to find an algorithm that generates a solution that when fed into the above algorithm results same as specified "outputVal".
The length of solution to be generated should be less than or equal to 32.

Method 1: Brute force. Not a big deal, because your "specialNum" is always in the range of an int, so after trying on average a few billion input values, you find the right one. Should be done in a few seconds.
Method 2: Brute force, but clever.
Consider the specialNum value before the last ch is processed. You first calculate (specialNum * 32) + (specialNum / 4) + ch. Since -128 <= ch < 128 or 0 <= ch < 256 depending on the signedness of char, you know the highest 23 bits of the result, independent of ch. After xor'ing ch with specialNum, you also know the highest 23 bits (if ch is signed, there are two possible values for the highest 23 bits). You check whether those 23 bits match the desired output, and if they don't, you have excluded all 256 values of ch in one go. So the brute force method will end on average after 16 million steps.
Now consider the specialNum value before the last two ch are processed. Again, you can determine the highest possible 14 bits of the result (if ch is signed with four alternatives) without examining the last two characters at all. If the highest 14 bits don't match, you are done.
Method 3: This is how you do it. Consider in turn all strings s of length 0, 1, 2, etc. (however, your algorithm will most likely find a solution much quicker). Calculate specialNum after processing the string s. Following your algorithm, and allowing for char to be signed, find the up to 4 different values that the highest 14 bits of specialNum might have after processing two further characters. If any of those matches the desired output, then examine the value of specialNum after processing each of the 256 possible values of the next character, and find the up to 2 different values that the highest 23 bits of specialNum might have after examining another char. If one of those matches the highest 23 bits of the desired output then examine what specialNum would be after processing each of the 256 possible next characters and look for a match.
This should work below a millisecond. If char is unsigned, it is faster.

Calculate and Store Power of very large Number

I am finding pow(2,i) where i can range: 0<=i<=100000.
Apart i have MOD=1000000007
powers[100000];
powers[0]=1;
for (i = 1; i <=100000; ++i)
{
powers[i]=(powers[i-1]*2)%MOD;
}
for i=100000 won't power value become greater than MOD ?
How do I store the power correctly?
The operation doesn't look feasible to me.
I am getting correct value up to i=70 max I guess.
I have to find sum+= ar[i]*power(2,i) and finally print sum%1000000007 where ar[i] is an additional array with some big numbers up to 10^5

As long as your modulus value is less than half the capacity of your data type, it will never be exceeded. That's because you take the previous value in the range 0..1000000006, double it, then re-modulo it bringing it back to that same range.
However, I can't guarantee that higher values won't cause you troubles, it's more mathematical analysis than I'm prepared to invest given the simple alternative. You could spend a lot of time analysing, checking and debugging, but it's probably better just to not allow the problem to occur in the first place.
The alternative? I'd tend to use the pre-generation method (having a program do the gruntwork up front, inserting the pre-generated values into an array easily and speedily accessible from your real program).
With this method, you can use tools that are well tested and known to work with massive values. Since this data is not going to change, it's useless calculating it every time your program starts.
If you want an easy (and efficient) way to do this, the following bash script in conjunction with bc and awk can do this:
#!/usr/bin/bash
bc >nums.txt <<EOF
i = 1;
for (x = 0;x <= 10000; x++) {
i % 1000000007;
i = i * 2;
}
EOF
awk 'BEGIN { printf "static int array[] = {" }
{ if (NR % 5 == 1) printf "\n ";
printf "%s, ",$0;
next
}
END { print "\n};" }' nums.txt
The bc part is the "meat" of the matter, it creates the large powers of two and outputs them modulo the number you provided. The awk part is simply to format them in C-style array elements, five per line.
Just take the output of that and put it into your code and, voila, there you have it, a compile-time-expensed array that you can use for fast lookup.
It takes only a second and a half on my box to generate the array and then you never need to do it again. You also won't have to concern yourself with the vagaries of modulo math :-)
static int array[] = {
1,2,4,8,16,
32,64,128,256,512,
1024,2048,4096,8192,16384,
32768,65536,131072,262144,524288,
1048576,2097152,4194304,8388608,16777216,
33554432,67108864,134217728,268435456,536870912,
73741817,147483634,294967268,589934536,179869065,
359738130,719476260,438952513,877905026,755810045,
511620083,23240159,46480318,92960636,185921272,
371842544,743685088,487370169,974740338,949480669,
898961331,797922655,595845303,191690599,383381198,
766762396,533524785,67049563,134099126,268198252,
536396504,72793001,145586002,291172004,582344008,
164688009,329376018,658752036,317504065,635008130,
270016253,540032506,80065005,160130010,320260020,
640520040,281040073,562080146,124160285,248320570,
:
861508356,723016705,446033403,892066806,784133605,
568267203,136534399,273068798,546137596,92275185,
184550370,369100740,738201480,476402953,952805906,
905611805,
};

If you notice that your modulo can be stored in int. MOD=1000000007(decimal) is equivalent of 0b00111011100110101100101000000111 and can be stored in 32 bits.
- i pow(2,i) bit representation
- 0 1 0b00000000000000000000000000000001
- 1 2 0b00000000000000000000000000000010
- 2 4 0b00000000000000000000000000000100
- 3 8 0b00000000000000000000000000001000
- ...
- 29 536870912 0b00100000000000000000000000000000
Tricky part starts when pow(2,i) is grater than your MOD=1000000007, but if you know that current pow(2,i) will be greater than your MOD, you can actually see how bits look like after MOD
- i pow(2,i) pow(2,i)%MOD bit representation
- 30 1073741824 73741817 0b000100011001010011000000000000
- 31 2147483648 147483634 0b001000110010100110000000000000
- 32 4294967296 294967268 0b010001100101001100000000000000
- 33 8589934592 589934536 0b100011001010011000000000000000
So if you have pow(2,i-1)%MOD you can do *2 actually on pow(2,i-1)%MOD till you're next pow(2,i) will be greater than MOD.
In example for i=34 you will use (589934536*2) MOD 1000000007 instead of (8589934592*2) MOD 1000000007, because 8589934592 can't be stored in int.
Additional you can try bit operations instead of multiplication for pow(2,i).
Bit operation same as multiplication for 2 is bit shift left.

Write a function to take 8 bits from an array and turn it into Decimal

I'm trying to write a function that takes 8 bits from an array that is 6x24 (just consider it taking a byte 1 bit at a time) and convert it to a decimal integer. Meaning there should be 18 numbers in total. Here is my code
int bitArray[6][24]; //the Array of bits, can only be a 1 or 0
int ex=0; //ex keeps track of the current exponent to use to calculate the decimal value of a binary digit
int decArray[18]; //array to store decimals
int byteToDecimal(int pos, int row) //takes two variables so you can give it an array column and row
{
numholder=0; //Temporary number for calculations
for(int x=pos; x<pos+8;x++) //pos is used to adjust where we start and stop looking at 1's and 0's in a row
{
if(bitArray[row][x] != 0)//if the row and column is a 1
{
numholder += pow(2, 7-ex);//2^(7-ex), meaning the first bit is worth 2^7, and the last is 2^0
}
ex++;
}
ex=0;
return numholder;
}
Then you can call the function like so
decArray[0]=byteToDecimal(0,0);
decArray[1]=byteToDecimal(8,0);
decArray[2]=byteToDecimal(16,0);
decArray[3]=byteToDecimal(0,1);
decArray[4]=byteToDecimal(8,1);
decArray[5]=byteToDecimal(16,1);
ect. When I place a single 1 into bitArray[0][0], calling the function gives me the number 127, when it should be 128.

Apparently bitArray (or at least the bytes involved) is not filled with zeros. The reason for that may vary. Most likely you have some leftover from previous operations with it. The second (insane) reason is that maybe Arduino C compiler doesn't initialize static objects with zeros (I've hever had experience with Arduino so I can't tell for sure).
In any case, try to call memset(bitArray, 0, sizeof(bitArray)) before you perform other operations with it.
Here is a demo written in plain C, demonstrating that normally your code should work fine.

modf() with BIG NUMBERS

I hope this finds you well.
I am trying to convert an index (number) for a word, using the ASCII code for that.
for ex:
index 0 -> " "
index 94 -> "~"
index 625798 -> "e#A"
index 899380 -> "!$^."
...
As we all can see, the 4th index correspond to a 4 char string. Unfortunately, at some point, these combinations get really big (i.e., for a word of 8 chars, i need to perform operations with 16 digit numbers (ex: 6634204312890625), and it gets really worse if I raise the number of chars of the word).
To support such big numbers, I had to upgrade some variables of my program from unsigned int to unsigned long long, but then I realized that modf() from C++ uses doubles and uint32_t (http://www.raspberryginger.com/jbailey/minix/html/modf_8c-source.html).
The question is: is this possible to adapt modf() to use 64 bit numbers like unsigned long long? I'm afraid that in case this is not possible, i'll be limited to digits of double length.
Can anyone enlight me please? =)

16-digit numbers fit within the range of a 64-bit number, so you should use uint64_t (from <stdint.h>). The % operator should then do what you need.
If you need bigger numbers, then you'll need to use a big-integer library. However, if all you're interested in is modulus, then there's a trick you can pull, based on the following properties of modulus:
mod(a * b) == mod(mod(a) * mod(b))
mod(a + b) == mod(mod(a) + mod(b))
As an example, let's express a 16-digit decimal number, x as:
x = x_hi * 1e8 + x_lo; // this is pseudocode, not real C
where x_hi is the 8 most-significant decimal digits, and x_lo the least-significant. The modulus of x can then be expressed as:
mod(x) = mod((mod(x_hi) * mod(1e8) + mod(x_lo));
where mod(1e8) is a constant which you can precalculate.
All of this can be done in integer arithmetic.

I could actually use a comment that was deleted right after (wonder why), that said:
modulus = a - a/b * b;
I've made a cast in the division to unsigned long long.
Now... I was a bit disappointed, because in my problem I thought I could keep raising the number of characters of the word with no problem. Nevertheless, I've started to get size issues at the n.º of chars = 7. Why? 95^7 starts to give huge numbers.
I was hoping to get the possibility to write a word like "my cat is so fat I 1234r5s" and calculate the index of this, but this word has almost 30 characters:
95^26 = 2635200944657423647039506726457895338535308837890625 combinations.
Anyway, thanks for the answer.

Analysis of the usage of prime numbers in hash functions

I was studying hash-based sort and I found that using prime numbers in a hash function is considered a good idea, because multiplying each character of the key by a prime number and adding the results up would produce a unique value (because primes are unique) and a prime number like 31 would produce better distribution of keys.
key(s)=s[0]*31(len–1)+s[1]*31(len–2)+ ... +s[len–1]
Sample code:
public int hashCode( )
{
int h = hash;
if (h == 0)
{
for (int i = 0; i < chars.length; i++)
{
h = MULT*h + chars[i];
}
hash = h;
}
return h;
}
I would like to understand why the use of even numbers for multiplying each character is a bad idea in the context of this explanation below (found on another forum; it sounds like a good explanation, but I'm failing to grasp it). If the reasoning below is not valid, I would appreciate a simpler explanation.
Suppose MULT were 26, and consider
hashing a hundred-character string.
How much influence does the string's
first character have on the final
value of 'h'? The first character's value
will have been multiplied by MULT 99
times, so if the arithmetic were done
in infinite precision the value would
consist of some jumble of bits
followed by 99 low-order zero bits --
each time you multiply by MULT you
introduce another low-order zero,
right? The computer's finite
arithmetic just chops away all the
excess high-order bits, so the first
character's actual contribution to 'h'
is ... precisely zero! The 'h' value
depends only on the rightmost 32
string characters (assuming a 32-bit
int), and even then things are not
wonderful: the first of those final 32
bytes influences only the leftmost bit
of `h' and has no effect on the
remaining 31. Clearly, an even-valued
MULT is a poor idea.

I think it's easier to see if you use 2 instead of 26. They both have the same effect on the lowest-order bit of h. Consider a 33 character string of some character c followed by 32 zero bytes (for illustrative purposes). Since the string isn't wholly null you'd hope the hash would be nonzero.
For the first character, your computed hash h is equal to c[0]. For the second character, you take h * 2 + c[1]. So now h is 2*c[0]. For the third character h is now h*2 + c[2] which works out to 4*c[0]. Repeat this 30 more times, and you can see that the multiplier uses more bits than are available in your destination, meaning effectively c[0] had no impact on the final hash at all.
The end math works out exactly the same with a different multiplier like 26, except that the intermediate hashes will modulo 2^32 every so often during the process. Since 26 is even it still adds one 0 bit to the low end each iteration.

This hash can be described like this (here ^ is exponentiation, not xor).
hash(string) = sum_over_i(s[i] * MULT^(strlen(s) - i - 1)) % (2^32).
Look at the contribution of the first character. It's
(s[0] * MULT^(strlen(s) - 1)) % (2^32).
If the string is long enough (strlen(s) > 32) then this is zero.

Other people have posted the answer -- if you use an even multiple, then only the last characters in the string matter for computing the hash, as the early character's influence will have shifted out of the register.
Now lets consider what happens when you use a multiplier like 31. Well, 31 is 32-1 or 2^5 - 1. So when you use that, your final hash value will be:
\sum{c_i 2^{5(len-i)} - \sum{c_i}
unfortunately stackoverflow doesn't understad TeX math notation, so the above is hard to understand, but its two summations over the characters in the string, where the first one shifts each character by 5 bits for each subsequent character in the string. So using a 32-bit machine, that will shift off the top for all except the last seven characters of the string.
The upshot of this is that using a multiplier of 31 means that while characters other than the last seven have an effect on the string, its completely independent of their order. If you take two strings that have the same last 7 characters, for which the other characters also the same but in a different order, you'll get the same hash for both. You'll also get the same hash for things like "az" and "by" other than in the last 7 chars.
So using a prime multiplier, while much better than an even multiplier, is still not very good. Better is to use a rotate instruction, which shifts the bits back into the bottom when they shift out the top. Something like:
public unisgned hashCode(string chars)
{
unsigned h = 0;
for (int i = 0; i < chars.length; i++) {
h = (h<<5) + (h>>27); // ROL by 5, assuming 32 bits here
h += chars[i];
}
return h;
}
Of course, this depends on your compiler being smart enough to recognize the idiom for a rotate instruction and turn it into a single instruction for maximum efficiency.
This also still has the problem that swapping 32-character blocks in the string will give the same hash value, so its far from strong, but probably adequate for most non-cryptographic purposes

would produce a unique value
Stop right there. Hashes are not unique. A good hash algorithm will minimize collisions, but the pigeonhole principle assures us that perfectly avoiding collisions is not possible (for any datatype with non-trivial information content).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js