I understand that offset operators perform multiplication and division on the number. So when I have 80 >> 3, I translate it by dividing 80 by two, three times. And when I see this 80 <<< 3, I translate it by multiplying 80 by two, three times. But I didn't understand correctly, because I don't understand that: 9 <<< 99 = 72, because I expected 9 multiplied by two, 99 times, so 9 * ( 2 ^ 99). But actually no, it doesn't work.... I've read different articles, but I still don't understand.
The thing to remember about bit-shifting is, your values only have a finite number of bits, and (for primitive value-types at least) that number of bits is going to be less than 99.
In fact, if you compile your code with warnings enabled, you'll probably a see a warning like this:
`warning: shift count >= width of type [-Wshift-count-overflow]`
... that is the compiler telling you that what you're trying to do invokes undefined behavior. Most likely the value you are shifting is either 32 bits or 64 bits long (depending on your code and/or the computer you are compiling for), so shifting left by more bits than that isn't a valid thing to do.
Related
I am confused about how to calculate the bit-reflected constants in the white paper "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction".
In the post Fast CRC with PCLMULQDQ NOT reflected and How the bit-reflect constant is calculated when we use CLMUL in CRC32, #rcgldr mentioned that "...are adjusted to compensate for the shift, so instead of x^(a) mod poly, it's (x^(a-32) mod poly)<<32...", but I do not understand what does this mean.
For example, constant k1=(x^(4*128+64)%P(x))=0x8833794c (on page 16) v.s. k1'=(x^(4*128+64-32)%P(x)<<32)'=(0x154442db4>>1) (on page 22), I can't see those two figures have any reflection relationship (10001000_00110011_01111001_01001100 v.s 10101010_00100010_00010110_11011010).
I guess my question is why the exponent needs to subtract 32 to compensate 32bits of left shift? and why k1 and (k1)' are not reflected?
Could you please help to interpret it? Thanks
I had carefully searched for the answer to this question on the internet, especially in StackOverflow, and I tried to understand the related posts but need some experts to interpret more.
I modded what was originally some Intel examples to work with Visual Studio | Windows, not-reflected and reflected for 16, 32, and 64 bit CRC, in this github repository.
https://github.com/jeffareid/crc
I added some missing comments and also added a program to generate the constants used in the assembly code for each of the 6 cases.
instead of x^(a) mod poly, it's (x^(a-32) mod poly)<<32
This is done for non-reflected CRC. The CRC is kept in the upper 32 bits, as well as the constants, so that the result of PCLMULQDQ ends up in the upper 64 bits, and then right shifted. Shifting the constants left 32 bits is the same as multiplying by 2^32, or with polynomial notation, x^32.
For reflected CRC, the CRC is kept in the lower 32 bits, which are logically the upper 32 bits of a reflected number. The issue is PCLMULQDQ multiplies the product by 2, right shifting the product by 1 bit, leaving bit 127 == 0 and the 127 bit product in bits 126 to 0. To compensate for that, the constants are (x^(a) mod poly) << 1 (left shift for reflected number is == divide by 2).
The example code at that github site includes crc32rg.cpp, which is the program to generate the constants used by crc32ra.asm.
Another issue occurs when doing 64 bit CRC. For non-reflected CRC, sometimes the constant is 65 bits (for example, if the divisor is 7), but only the lower 64 bits are stored, and the 2^64 bit handled with a few more instructions. For reflected 64 bit CRC, since the constants can't be shifted left, (x^(a-1) mod poly) is used instead.
#rcgldr I think I didn't catch your point tbh... probably I didn't make my question clear...
If my understanding of the code (reverse CRC32) is correct, take the simplest scenario as an example, the procedure of 1-fold 32byte block is shown here. I don't understand why the exponents used in the constants are not 128 and 192 (=128+64) respectively.
I'm using Java, for this.
I have the code 97 which represents the 'a' character is ascii. I convert 97 to binary which gives me 1100001 (7 bits) I want to convert this to 12 bits, I can add leading 0's to the existing 7 bits until it reaches 12 bits, but this seems inefficient. I've been thinking of using the & bit wise operator to make zeros all but the lowest bits of 97 to reach 12 bits, is this possible and how can I do it?
byte buffer = (byte) (code & 0xff);
Above line of code will give me 01100001 no?
which gives me 1100001 (7 bits)
Your value buffer is 8 bits. Because that's what a byte is: 8 bits.
If code has type int (detail added in comment below) it is already a 32-bit number with, in this case, 25 leading zero bits. You need do nothing with it. It's got all the bits you're asking for.
There is no Java integral type with 12 bits, nor is one directly achievable, since 12 is not a multiple of the byte size. It's unclear why you want exactly 12 bits. What harm do you think an extra 20 zero bits will do?
The important fact is that in Java, integral types (char, byte, int, etc.) have a fixed number of bits, defined by the language specification.
With reference to your original code & 0xff - code has 32 bits. In general these bits could have any value.
In your particular case, you told us that code was 97, and therefore we know the top 25 bits of code were zero; this follows from the binary representation of 97.
Again in general, & 0xff would set all but the low 8 bits to zero. In your case, that had no actual effect because they were already zero. No bits are "added" - they are always there.
Example of question:
Is calculating 123 * 456 faster than calculating 123456 * 7890? Or is it the same speed?
I'm wondering about 32 bit unsigned integers, but I won't ignore answers about other types (64 bit, signed, float, etc.). If it is different, what is the difference due to? Whether or not the bits are 0/1?
Edit: If it makes a difference, I should clarify that I'm referring to any number (two random numbers lower than 100 vs two random numbers higher than 1000)
For builtin types up to at least the architecture's word size (e.g. 64 bit on a modern PC, 32 or 16 bit on most low-cost general purpose CPUs from the last couple decades), for every compiler/implementation/version and CPU I've ever heard of, the CPU opcode for multiplication of a particular integral size takes a certain number of clock cycles irrespective of the quantities involved. Multiplications of data with different sizes, performs differently on some CPUs (e.g. AMD K7 has 3 cycles latency for 16 bit IMUL, vs 4 for 32 bit).
It is possible that on some architecture and compiler/flags combination, a type like long long int has more bits than the CPU opcodes can operate on in one instruction, so the compiler may emit code to do the multiplication in stages and that will be slower than multiplication of CPU-supported types. But again, a small value stored at run-time in a wider type is unlikely to be treated - or perform - any differently than a larger value.
All that said, if one or both values are compile-time constants, the compiler is able to avoid the CPU multiplication operator and optimise to addition or bit shifting operators for certain values (e.g. 1 is obviously a no-op, either side 0 ==> 0 result, * 4 can sometimes be implemented as << 2). There's nothing in particular stopping techniques like bit shifting being used for larger numbers, but a smaller percentage of such numbers can be optimised to the same degree (e.g. there're more powers of two - for which multiplication can be performed using bit shifting left - between 0 and 1000 than between 1000 and 2000).
This is highly dependendent on the processor architecture and model.
In the old days (ca 1980-1990), the number of ones in the two numbers would be a factor - the more ones, the longer it took to multiply [after sign adjustment, so multiplying by -1 wasn't slower than multiplying by 1, but multiplying by 32767 (15 ones) was notably slower than multiplying by 17 (2 ones)]. That's because a multiply is essentially:
unsigned int multiply(unsigned int a, unsigned int b)
{
res = 0;
for(number of bits)
{
if (b & 1)
{
res += a;
}
a <<= 1;
b >>= 1;
}
}
In modern processors, multiply is quite fast either way, but 64-bit multiply can be a clock cycle or two slower than a 32-bit value. Simply because modern processors can "afford" to put down the whole logic for doing this in a single cycle - both when it comes to speed of transistors themselves, and the area that those transistors take up.
Further, in the old days, there was often instructions to do 16 x 16 -> 32 bit results, but if you wanted 32 x 32 -> 32 (or 64), the compiler would have to call a library function [or inline such a function]. Today, I'm not aware of any modern high end processor [x86, ARM, PowerPC] that can't do at least 64 x 64 -> 64, some do 64 x 64 -> 128, all in a single instruction (not always a single cycle tho').
Note that I'm completely ignoring the fact that "if the data is in cache is an important factor". Yes, that is a factor - and it's a bit like ignoring wind resistance when traveling at 200 km/h - it's not at all something you ignore in the real world. However, it is quite unimportant for THIS discussion. Just like people making sports cars care about aerodynamics, to get complex [or simple] software to run fast involves a certain amount of caring about the cache-content.
For all intents and purposes, the same speed (even if there were differences in computation speed, they would be immeasurable). Here is a reference benchmarking different CPU operations if you're curious: http://www.agner.org/optimize/instruction_tables.pdf.
Suppose I have the following code to loop over numbers as follows:
int p;
cin>>p;
for(unsigned long long int i=3*pow(10,p);i<6*pow(10,p);i++){
//some code goes here
}
Now, based on certain condition checks I need to print a i in between the range : 3*pow(10,p)<= i <6*pow(10,p)
The code works fine upto p=8, then it becomes pretty sluggish and the compiler seems to get stuck for p=9,10,11 and onwards.
I am guessing the problem lies in using the correct data type. What should be the correct data type to be used here ?
The purpose of this loop is to find the decent numbers in between the range. Decent numbers conditions as follows:
1) 3, 5, or both as its digits. No other digit is allowed.
2) Number of times 3 appears is divisible by 5.
3) Number of times 5 appears is divisible by 3.
NOTE: I used unsigned long long int here (0 to 18,446,744,073,709,551,615) . I am running on a 32-bit machine.
You could use <cstdint> and its int64_t (which is guaranteed to have 64 bits) and you should compute the power outside of the loop; and long long has at least 64 bits in recent C or C++ standards.
But, as mentioned in a comment by 1201ProgramAlarm, 3e11 (i.e. 300 billions) loops is a lot, even on our fast machines. It could take minutes or hours: an elementary operation is needing a nanosecond (or half of it). 3e9 operations need several seconds; 3e11 operations need several minutes. Your loop body could do several thousands (or even more) elementary operations (i.e. machine code instructions).
It is not the compiler which is stuck: compiling your code is easy and quick (as long as the program has a reasonable size, e.g. less than ten thousand lines of code, without weird preprocessor or template expansion tricks expanding them pathologically). It is the computer running the compiled executable.
If you benchmark your code, don't forget to enable optimizations in your compiler (e.g. compiling with g++ -Wall -O2 -arch=native if using GCC...)
You should think a lot more on your problem and reformulate it to have a smaller search space.
Actually, your decent numbers might more be thought as strings of digits representing them; after all, a number does not have digits (in particular a number expressed in binary or ternary notation cannot have 3 as its digit), only some representation of a number have digits.
Then you should only consider the strings of 3 or 5 which are shorter than 12 characters, and you have much less of them (less than 10000, and probably less than 213 i.e. 8192); iterating ten thousand times should be quick. So generate every string smaller than e.g. 15 characters with only 3 and 5 in it, and test if it is decent.
This question already has answers here:
Why is unsigned integer overflow defined behavior but signed integer overflow isn't?
(6 answers)
Closed 7 years ago.
I just simply wanted to know, who is responsible to deal with mathematical overflow cases in a computer ?
For example, in the following C++ code:
short x = 32768;
std::cout << x;
Compiling and running this code on my machine gave me a result of -32767
A "short" variable's size is 2 bytes .. and we know 2 bytes can hold a maximum decimal value of 32767 (if signed) .. so when I assigned 32768 to x .. after exceeding its max value 32767 .. It started counting from -32767 all over again to 32767 and so on ..
What exactly happened so the value -32767 was given in this case ?
ie. what are the binary calculations done in the background the resulted in this value ?
So, who decided that this happens ? I mean who is responsible to decide that when a mathematical overflow happens in my program .. the value of the variable simply starts again from its min value, or an exception is thrown for example, or the program simply freezes .. etc ?
Is it the language standard, the compiler, my OS, my CPU, or who is it ?
And how does it deal with that overflow situation ? (Simple explanation or a link explaining it in details would be appreciated :) )
And btw, pls .. Also, who decides what a size of a 'short int' for example on my machine would be ? also is it a language standard, compiler, OS, CPU .. etc ?
Thanks in advance! :)
Edit:
Ok so I understood from here : Why is unsigned integer overflow defined behavior but signed integer overflow isn't?
that It's the processor who defines what happens in an overflow situation (like for example in my machine it started from -32767 all over again), depending on "representations for signed values" of the processor, ie. is it sign magnitude, one's complement or two's complement ...
is that right ?
and in my case (When the result given was like starting from the min value -32767 again.. how do you suppose my CPU is representing the signed values, and how did the value -32767 for example come up (again, binary calculations that lead to this, pls :) ? )
It doesn't start at it's min value per se. It just truncates its value, so for a 4 bit number, you can count until 1111 (binary, = 15 decimal). If you increment by one, you get 10000, but there is no room for that, so the first digit is dropped and 0000 remains. If you would calculate 1111 + 10, you'd get 1.
You can add them up as you would on paper:
1111
0010
---- +
10001
But instead of adding up the entire number, the processor will just add up until it reaches (in this case) 4 bits. After that, there is no more room to add up any more, but if there is still 1 to 'carry', it sets the overflow register, so you can check whether the last addition it did overflowed.
Processors have basic instructions to add up numbers, and they have those for smaller and larger values. A 64 bit processor can add up 64 bit numbers (actually, usually they don't add up two numbers, but actually add a second number to the first number, modifying the first, but that's not really important for the story).
But apart from 64 bits, they often can also add up 32, 16 and 8 bit numbers. That's partly because it can be efficient to add up only 8 bits if you don't need more, but also sometimes to be backwards compatible with older programs for a previous version of a processor which could add up to 32 bits but not 64 bits.
Such a program uses an instruction to add up 32 bits numbers, and the same instruction must also exist on the 64 bit processor, with the same behavior if there is an overflow, otherwise the program wouldn't be able to run properly on the newer processor.
Apart from adding up using the core constructions of the processor, you could also add up in software. You could make an inc function that treats a big chunk of bits as a single value. To increment it, you can let the processor increment the first 64 bits. The result is stored in the first part of your chunk. If the overflow flag is set in the processor, you take the next 64 bits and increment those too. This way, you can extend the limitation of the processor to handle large numbers from software.
And same goes for the way an overflow is handled. The processor just sets the flag. Your application can decide whether to act on it or not. If you want to have a counter that just increments to 65535 and then wraps to 0, you (your program) don't need to do anything with the flag.