This question already has answers here:
Incremental Checksums
(3 answers)
Closed 8 years ago.
If I have substrings S0, S1, ... Sn with calculated CRCs C0, C1, ... Cn, am I able to determine the CRC C0...n of concatenated input S0S1...Sn with any substantially greater efficiency than linearly processing the whole string?
Obviously, C0...n = CRC(S1...n, initialized with C0), but I'd like to know whether C0...n = f(C0,C1,...Cn) for some f() with O(n) complexity instead of O(|S0S1...Sn|).
Yes. You can see how in the implementation of crc32_combine() in zlib. It takes crc1, crc2, and len2, where len2 is the number of bytes in the block on which crc2 was calculated. It takes O(log(len2)) time. The combination can be repeated for following blocks.
The approach is to continue the CRC calculation on crc1, following with len2 zero bytes, and then exclusive-or crc2. The len2 bytes are applied with a zero operator that is repeatedly squared and applied for each 1 bit in len2, which permits the O(log(len2)) execution time. The routine was added to zlib in 2004.
Related
I have more than 1e7 sequences of tokens, where each token can only take one of four possible values.
In order to make this dataset fit into memory, I decided to encode each token in 2 bits, which allows to store 4 tokens in a byte instead of just one (when using a char for each token / std::string for a sequence). I store each sequence in a char array.
For some algorithm, I need to test arbitrary subsequences of two token sequences for exact equality. Each subsequence can have an arbitrary offset. The length is typically between 10 and 30 tokens (random) and is the same for the two subsequences.
My current method is to operate in chunks:
Copy up to 32 tokens (each having 2 bit) from each subsequences into an uint64_t. This is realized in a loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t.
Compare the two uint64_t. If they are not equal, return.
Repeat until all tokens in the subsequences have been processed.
#include <climits>
#include <cstdint>
using Block = char;
constexpr int BitsPerToken = 2;
constexpr int TokenPerBlock = sizeof(Block) * CHAR_BIT / BitsPerToken;
Block getTokenFromBlock(Block b, int nt) noexcept
{
return (b >> (nt * BitsPerToken)) & ((1UL << (BitsPerToken)) - 1);
}
bool seqEqual(Block const* seqA, int startA, int endA, Block const* seqB, int startB, int endB) noexcept
{
using CompareBlock = uint64_t;
constexpr int TokenPerCompareBlock = sizeof(CompareBlock) * CHAR_BIT / BitsPerToken;
const int len = endA - startA;
int posA = startA;
int posB = startB;
CompareBlock curA = 0;
CompareBlock curB = 0;
for (int i = 0; i < len; ++i, ++posA, ++posB)
{
const int cmpIdx = i % TokenPerBlock;
const int blockA = posA / TokenPerBlock;
const int idxA = posA % TokenPerBlock;
const int blockB = posB / TokenPerBlock;
const int idxB = posB % TokenPerBlock;
if ((i % TokenPerCompareBlock) == 0)
{
if (curA != curB)
return false;
curA = 0;
curB = 0;
}
curA += getTokenFromBlock(seqA[blockA], idxA) << (BitsPerToken * cmpIdx);
curB += getTokenFromBlock(seqB[blockB], idxB) << (BitsPerToken * cmpIdx);
}
if (curA != curB)
return false;
return true;
}
I figured that this should be quite fast (comparing 32 tokens simultaneously), but it is more than two times slower than using an std::string (with each token stored in a char) and its operator==.
I have looked into std::memcmp, but cannot use it because the subsequence might start somewhere within a byte (at a multiple of 2 bits, though).
Another candidate would be boost::dynamic_bitset, which basically implements the same storage format. However, it does not include equality tests.
How can I achieve fast equality tests using this compressed format?
First of all, this is the kind of computation where the target processor, RAM, compiler and compiler flags can drastically change the results. Unfortunately these critical information are not provided. Let's assume you use a quite recent mainstream x86-64 processor, a common DDR4-SDRAM, a compiler like Clang/GCC relatively up-to-date, and optimizations are enabled (ie. -O3 and possibly -march=native).
Clang and GCC use a fast comparison functions for comparing strings : respectively memcmp for GCC 12 and bcmp for Clang 15. The two functions are highly optimized on most platforms : they typically compare short strings by blocks of 8 bytes (uint64_t) and large strings by using SIMD instructions.
Your optimization is good to reduce the memory footprint but it introduces more computation and there is a high chance for the operation to be already compute-bound if the input buffer is already in the CPU cache. In addition, the computation is not SIMD-friendly due to the inner loop : the compiler will certainly not generate an efficient code due toe the bit-wise operations. The thing is scalar codes are slow. In fact, scalar byte-per-byte computations are generally so slow that they are usually far from being able to saturate the RAM bandwidth (at least the one achievable using only 1 core) as opposed to to memcmp. For example, a Skylake/Coffeelake processor at 4 GHz can only read 8 GiB/s from the L1 cache using a scalar byte-per-byte code while an AVX-2 SIMD code can read 256 GiB/s. For the write it is twice smaller : 4 GiB/s VS 128 GiB/s. A 1-channel DDR4-SDRAM # 3200MHz can theoretically reach ~24 GiB/s, that is, far more than a byte-per-byte scalar sequential code. The L3 cache have a much bigger bandwidth.
If you want a fast code for large sequences, then you need to either help your compiler so it can use SIMD instruction (not so easy in this case), to use non-portable SIMD intrinsics or possibly to use a relatively-portable SIMD library to generate quite-good SIMD code (though low-level platform-dependent intrinsics are more flexible/featureful).
I expect the main bottleneck to come from the "loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t". Indeed, this loop will likely generate a dependency chain of instructions (operating on the same uint64_t variable) that cannot be executed efficiently by the processor nor easily optimized by the compiler.
A typical solution would be to read blocks of 8 bytes (using memcpy to do it correctly, and hope the compiler optimize it properly). The bits can be reordered using a bswap instruction on x86-64 processors and it is not needed on big-endian processors. A shift+mask can be applied so to compare only the useful part. Here is an (untested) example to show the idea:
if(length >= 16)
{
uint64_t block1, block2;
uint64_t prev_block1 = 0, prev_block2 = 0;
unsigned int shift1 = (start1 % 4) * 2;
unsigned int shift2 = (start2 % 4) * 2;
uint64_t mask = 0xFFFFFFFFFFFFFF00ull;
// Read blocks 7 byte per 7 byte for sake of simplicity
for(size_t i=0; i<length-7 ; i+=7)
{
// Safe and cheap and GCC/Clang
memcpy(&block1, charArray1[i], 8);
memcpy(&block2, charArray2[i], 8);
// Architecture-dependent: reorder bytes on little-endian processors.
// There is a fast instruction for that on x86-64 processors: bswap.
// See: https://stackoverflow.com/questions/36497605
block1 = reorder_bytes(block1);
block2 = reorder_bytes(block2);
block1 = (block1 << shift1) & mask;
block2 = (block2 << shift2) & mask;
if(block1 != block2)
return false;
}
}
// TODO: compute the reminder part for the last block
This operation can be done using the SSE/AVX instruction set so to be faster for large sequences. Note you can perform a special optimization when shift1 == shift2 (especially when the both are equal to 0).
One should keep in mind that the bit-packing computation is pretty expensive, even using a SIMD code. It will certainly not be faster than a memcpy unless the operation is memory bound which is unlikely to be the case. For example, a Skylake/Coffeelake processor can load and compare 2 blocks of 32 bytes (ie. 32 tokens per block) in only 1 cycle (reciprocal throughput) using the AVX-2 SIMD instruction set, while there is no chance each iteration of the above bit-packing loop can take less than 2 cycles to compute 7 bytes (ie. 28 tokens). Using AVX-2 to optimize the above code is possible but the AVX lanes and the byte reordering results in several additional instructions being required so it will certainly be still slightly slower than just a basic very-fast comparisons (few cycles to compute ~120 tokens).
The only use-case where packing can help is when multiple core are used to do the computation. Indeed, in that case, the bit-packing code can scale well because it is likely compute-bound while the string-based version will quickly be limited by the speed of the RAM since it is likely memory-bound.
If there are only 10million tokens total, its 20Mbit or 2-3MB. If you keep their shifted versions in different arrays such as from 2 bit shifted to 30 bit shifted (assuming 4byte comparison at once, ignore 32 bit shift as it means just a different starting position), you can do a direct comparison (std::memcmp) with no shifting involved (fast) after selecting the right array with modulo of the arbitrary offset. But this requires the token sequence to be constant through many function calls (if not lifetime of program).
If these tokens are part of a much bigger data, you can put a caching layer (that caches fixed length chunks and joins them to get requested sub-sequence for A and B) just before the shifted initialization. Maybe LRU/LFU works fast enough if its token access pattern is cache-friendly. If its not cache friendly, then perhaps just reaching the arrays could be the bottleneck with or without shifting.
If you do checking per byte instead of per 4 bytes, it requires only 4 arrays instead of 16 and it shouldn't add too big requirement with caching.
You can also add an XOR result of fixed-length (like 50-100) sub-sequences for every offset as a way of quicker exiting. Again, this requires 4x more memory space. If XOR results of first tokens (+fixed length) are not equal, then they are not equal. This would reduce number of comparisons at least.
Another way is directly caching f(x,y)->bool like Python language does with its own caching. But this would be much worse than "fixed-length-chunked-caching & joining them" due to non-reusable parts & a lot of duplication.
I have the following hash algorithm:
unsigned long specialNum=0x4E67C6A7;
unsigned int ch;
char inputVal[]=" AAPB2GXG";
for(int i=0;i<strlen(inputVal);i++)
{
ch=inputVal[i];
ch=ch+(specialNum*32);
ch=ch+(specialNum/4);
specialNum=bitXor(specialNum,ch);
}
unsigned int outputVal=specialNum;
The bitXor simply does the Xor operation:
int bitXor(int a,int b)
{
return (a & ~b) | (~a & b);
}
Now I want to find an Algorithm that can generate an "inputVal" when the outputVal is given.(The generated inputVal may not be necessarily be same as the original inputVal.That's why I want to find collision).
This means that I need to find an algorithm that generates a solution that when fed into the above algorithm results same as specified "outputVal".
The length of solution to be generated should be less than or equal to 32.
Method 1: Brute force. Not a big deal, because your "specialNum" is always in the range of an int, so after trying on average a few billion input values, you find the right one. Should be done in a few seconds.
Method 2: Brute force, but clever.
Consider the specialNum value before the last ch is processed. You first calculate (specialNum * 32) + (specialNum / 4) + ch. Since -128 <= ch < 128 or 0 <= ch < 256 depending on the signedness of char, you know the highest 23 bits of the result, independent of ch. After xor'ing ch with specialNum, you also know the highest 23 bits (if ch is signed, there are two possible values for the highest 23 bits). You check whether those 23 bits match the desired output, and if they don't, you have excluded all 256 values of ch in one go. So the brute force method will end on average after 16 million steps.
Now consider the specialNum value before the last two ch are processed. Again, you can determine the highest possible 14 bits of the result (if ch is signed with four alternatives) without examining the last two characters at all. If the highest 14 bits don't match, you are done.
Method 3: This is how you do it. Consider in turn all strings s of length 0, 1, 2, etc. (however, your algorithm will most likely find a solution much quicker). Calculate specialNum after processing the string s. Following your algorithm, and allowing for char to be signed, find the up to 4 different values that the highest 14 bits of specialNum might have after processing two further characters. If any of those matches the desired output, then examine the value of specialNum after processing each of the 256 possible values of the next character, and find the up to 2 different values that the highest 23 bits of specialNum might have after examining another char. If one of those matches the highest 23 bits of the desired output then examine what specialNum would be after processing each of the 256 possible next characters and look for a match.
This should work below a millisecond. If char is unsigned, it is faster.
This question already has answers here:
n & (n-1) what does this expression do? [duplicate]
(4 answers)
Closed 6 years ago.
I need some explanation how this specific line works.
I know that this function counts the number of 1's bits, but how exactly this line clears the rightmost 1 bit?
int f(int n) {
int c;
for (c = 0; n != 0; ++c)
n = n & (n - 1);
return c;
}
Can some explain it to me briefly or give some "proof"?
Any unsigned integer 'n' will have the following last k digits: One followed by (k-1) zeroes: 100...0
Note that k can be 1 in which case there are no zeroes.
(n - 1) will end in this format: Zero followed by (k-1) 1's: 011...1
n & (n-1) will therefore end in 'k' zeroes: 100...0 & 011...1 = 000...0
Hence n & (n - 1) will eliminate the rightmost '1'. Each iteration of this will basically remove the rightmost '1' digit and hence you can count the number of 1's.
I've been brushing up on bit manipulation and came across this. It may not be useful to the original poster now (3 years later), but I am going to answer anyway to improve the quality for other viewers.
What does it mean for n & (n-1) to equal zero?
We should make sure we know that since that is the only way to break the loop (n != 0).
Let's say n=8. The bit representation for that would be 00001000. The bit representation for n-1 (or 7) would be 00000111. The & operator returns the bits set in both arguments. Since 00001000 and 00000111 do not have any similar bits set, the result would be 00000000 (or zero).
You may have caught on that the number 8 wasn't randomly chosen. It was an example where n is power of 2. All powers of 2 (2,4,8,16,etc) will have the same result.
What happens when you pass something that is not an exponent of 2? For example, when n=6, the bit representation is 00000110 and n-1=5 or 00000101.The & is applied to these 2 arguments and they only have one single bit in common which is 4. Now, n=4 which is not zero so we increment c and try the same process with n=4. As we've seen above, 4 is an exponent of 2 so it will break the loop in the next comparison. It is cutting off the rightmost bit until n is equal to a power of 2.
What is c?
It is only incrementing by one every loop and starts at 0. c is the number of bits cut off before the number equals a power of 2.
I hope this finds you well.
I am trying to convert an index (number) for a word, using the ASCII code for that.
for ex:
index 0 -> " "
index 94 -> "~"
index 625798 -> "e#A"
index 899380 -> "!$^."
...
As we all can see, the 4th index correspond to a 4 char string. Unfortunately, at some point, these combinations get really big (i.e., for a word of 8 chars, i need to perform operations with 16 digit numbers (ex: 6634204312890625), and it gets really worse if I raise the number of chars of the word).
To support such big numbers, I had to upgrade some variables of my program from unsigned int to unsigned long long, but then I realized that modf() from C++ uses doubles and uint32_t (http://www.raspberryginger.com/jbailey/minix/html/modf_8c-source.html).
The question is: is this possible to adapt modf() to use 64 bit numbers like unsigned long long? I'm afraid that in case this is not possible, i'll be limited to digits of double length.
Can anyone enlight me please? =)
16-digit numbers fit within the range of a 64-bit number, so you should use uint64_t (from <stdint.h>). The % operator should then do what you need.
If you need bigger numbers, then you'll need to use a big-integer library. However, if all you're interested in is modulus, then there's a trick you can pull, based on the following properties of modulus:
mod(a * b) == mod(mod(a) * mod(b))
mod(a + b) == mod(mod(a) + mod(b))
As an example, let's express a 16-digit decimal number, x as:
x = x_hi * 1e8 + x_lo; // this is pseudocode, not real C
where x_hi is the 8 most-significant decimal digits, and x_lo the least-significant. The modulus of x can then be expressed as:
mod(x) = mod((mod(x_hi) * mod(1e8) + mod(x_lo));
where mod(1e8) is a constant which you can precalculate.
All of this can be done in integer arithmetic.
I could actually use a comment that was deleted right after (wonder why), that said:
modulus = a - a/b * b;
I've made a cast in the division to unsigned long long.
Now... I was a bit disappointed, because in my problem I thought I could keep raising the number of characters of the word with no problem. Nevertheless, I've started to get size issues at the n.ยบ of chars = 7. Why? 95^7 starts to give huge numbers.
I was hoping to get the possibility to write a word like "my cat is so fat I 1234r5s" and calculate the index of this, but this word has almost 30 characters:
95^26 = 2635200944657423647039506726457895338535308837890625 combinations.
Anyway, thanks for the answer.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best way to detect integer overflow in C/C++
If I have an expression x + y (in C or C++) where x and y are both of type uint64_t which causes an integer overflow, how do I detect how much it overflowed by (the carry), place than in another variable, then compute the remainder?
The remainder will already be stored in the sum of x + y, assuming you are using unsigned integers. Unsigned integer overflow causes a wrap around ( signed integer overflow is undefined ). See standards reference from Pascal in the comments.
The overflow can only be 1 bit. If you add 2 64 bit numbers, there cannot be more than 1 carry bit, so you just have to detect the overflow condition.
For how to detect overflow, there was a previous question on that topic: best way to detect integer overflow.
For z = x + y, z stores the remainder. The overflow can only be 1 bit and it's easy to detect. If you were dealing with signed integers then there's an overflow if x and y have the same sign but z has the opposite. You cannot overflow if x and y have different signs. For unsigned integers you just check the most significant bit in the same manner.
The approach in C and C++ can be quite different, because in C++ you can have operator overloading work for you, and wrap the integer you want to protect in some kind of class (for which you would overload the necessary operators. In C, you would have to wrap the integer you want to protect in a structure (to carry the remainder as well as the result) and call some function to do the heavy lifting.
Other than that, the approach in the two languages is the same: depending on the operation you want to perform (adding, in your example) you have to figure out the worst that could happen and handle it.
In the case of adding, it's quite simple: if the sum of the two is going to be greater than some maximum value (which will be the case if the difference of that maximum value M and one of the operands is greater than the other operand) you can calculate the remainder - the part that's too big: if ((M - O1) > O2) R = O2 - (M - O1) (e.g. if M is 100, O1 is 80 and O2 is 30, 30 - (100 - 80) = 10, which is the remainder).
The case of subtraction is equally simple: if your first operand is smaller than the second, the remainder is the second minus the first (if (O1 < O2) { Rem = O2 - O1; Result = 0; } else { Rem = 0; Result = O1 - O2; }).
It's multiplication that's a bit more difficult: your safest bet is to do a binary multiplication of the values and check that your resulting value doesn't exceed the number of bits you have. Binary multiplication is a long multiplication, just like you would do if you were doing a decimal multiplication by hand on paper, so, for example, 12 * 5 is:
0110
0100
====
0110
0
0110
0
++++++
011110 = 40
if you'd have a four-bit integer, you'd have an overflow of one bit here (i.e. bit 4 is 1, bit 5 is 0. so only bit 4 counts as an overflow).
For division you only really need to care about division by 0, most of the time - the rest will be handled be your CPU.
HTH
rlc