Can this checksum algorithm be improved? - c++

We have a very old, unsupported program which copies files across SMB shares. It has a checksum algorithm to determine if the file contents have changed before copying. The algorithm seems easily fooled -- we've just found an example where two files, identical except a single '1' changing to a '2', return the same checksum. Here's the algorithm:
unsigned long GetFileCheckSum(CString PathFilename)
{
FILE* File;
unsigned long CheckSum = 0;
unsigned long Data = 0;
unsigned long Count = 0;
if ((File = fopen(PathFilename, "rb")) != NULL)
{
while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE)
{
CheckSum ^= Data + ++Count;
Data = 0;
}
fclose(File);
}
return CheckSum;
}
I'm not much of a programmer (I am a sysadmin) but I know an XOR-based checksum is going to be pretty crude. What're the chances of this algorithm returning the same checksum for two files of the same size with different contents? (I'm not expecting an exact answer, "remote" or "quite likely" is fine.)
How could it be improved without a huge performance hit?
Lastly, what's going on with the fread()? I had a quick scan of the documentation but I couldn't figure it out. Is Data being set to each byte of the file in turn? Edit: okay, so it's reading the file into unsigned long (let's assume a 32-bit OS here) chunks. What does each chunk contain? If the contents of the file are abcd, what is the value of Data on the first pass? Is it (in Perl):
(ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')

MD5 is commonly used to verify the integrity of transfer files. Source code is readily available in c++. It is widely considered to be a fast and accurate algorithm.
See also Robust and fast checksum algorithm?

I'd suggest you take a look at Fletcher's checksum, specifically fletcher-32, which ought to be fairly fast, and detect various things the current XOR chain would not.

You could easily improve the algorithm by using a formula like this one:
Checksum = (Checksum * a + Data * b) + c;
If a, b and c are large primes, this should return good results. After this, rotating (not shifting!) the bits of checksum will further improve it a bit.
Using primes, this is a similar algorithm to that used for Linear congruential generators - it guarantees long periods and good distribution.

The fread bit is reading in the file one chunk at a time. Each chunk is the size of a long (in c this is not a well defined size but you can assume 32 or 64 bits). Depending on how it gets buffered, this might not be to bad. OTOH, reading a larger chunk into an array and looping over it might be a lot faster.

Even "expensive" cryptographic hash functions usually require multiple iterations to take significant amounts of time. Although no longer recommended for cryptographic purposes, where users would deliberately try to create collisions, functions like SHA1 and MD5 are widely available and suitable for this purpose.
If a smaller hash value is needed, CRC is alright, but not great. A n-bit CRC will fail to detect a small fraction of changes that are longer than n bits. For example, suppose just a single dollar amount in a file is changed, from $12,345 to $34,567. A 32-bit CRC might miss that change.
Truncating the result of a longer cryptographic hash will detect changes more reliably than a CRC.

I seems like your algorithm makes no effort to deal with files that are not an exact multiple of 4 bytes in size. The return value of fread is not a boolean but the number of bytes actually read, which will differ from 4 in the case of an EOF or if an error occurred. You are checked for neither, but simply assuming that if it didn't return 0, you have 4 valid bytes in 'data' which which to calculate your hash.
If you really want to use a hash, I'd recommend several things. First, use a simple cryptographic hash like MD5, not CRC32. CRC32 is decent for checking data validity, but for spanning a file system and ensuring no collisions, its not as great a tool because of the birthday paradox mentioned in the comments elsewhere. Second, don't write the function yourself. Find an existing implementation. Finally, consider simply using rsync to replicate files instead of rolling your own solution.

{
CheckSum ^= Data + ++Count;
Data = 0;
}
I don't think "++Count" do much work. The code is equivalent with
{
CheckSum ^= Data;
}
XORing a sequence of bytes is not enough. Especially with text files.
I suggest to use a hash function.

SHA-1 and (more recently SHA-2) provide excellent hashing functions and I believe as slowly supplanting MD5 due to better hashing properties. All of them (md2, sha, etc...) have efficient implementations and return a hash of a buffer that is several characters long (although always a fixed length). are provably more reliable than reducing a hash to an integer. If I had my druthers, I'd use SHA-2. Follow this link for libraries that implement SHA checksums.
If you don't want to compile in those libraries, linux (and probably cygwin) has the following executables: md5sum, sha1sum, sha224sum, sha256sum, sha384sum, sha512sum; to which you can provide your file and they will print out the checksum as a hex string.
You can use popen to execute those programs -- with something like this:
const int maxBuf=1024;
char buf[maxBuf];
FILE* f = popen( "sha224sum myfile", "w" );
int bytesRead = f.read( buf, maxBuf );
fclose( f );
Obviously this will run quite a lot slower, but makes for a useful first pass.
If speed is an issue, given that file hashing operations like this and I/O bound (memory and disk access will be you bottlenecks), I'd expect all of this algorithms to run about as fast a one that produces an unsigned int. Perl and Python also come with implementations of MD5 SHA1 and SHA2 and will probably run as fast as in C/C++.

Related

Detect endianness of binary file data

Recently I was (again) reading about 'endian'ness. I know how to identify the endianness of host, as there are lots of post on SO, and also I have seen this, which I think is pretty good resource.
However, one thing I like to know is to how to detect the endianness of input binary file. For example, I am reading a binary file (using C++) like following:
ifstream mydata("mydata.raw", ios::binary);
short value;
char buf[sizeof(short)];
int dataCount = 0;
short myDataMat[DATA_DIMENSION][DATA_DIMENSION];
while (mydata.read(reinterpret_cast<char*>(&buf), sizeof(buf)))
{
memcpy(&value, buf, sizeof(value));
myDataMat[dataCount / DATA_DIMENSION][dataCount%DATA_DIMENSION] = value;
dataCount++;
}
I like to know how I can detect the endianness in the mydata.raw, and whether endianness affects this program anyway.
Additional Information:
I am only manipulating the data in myDataMat using mathematical operations, and no pointer operation or bitwise operation is done on the data).
My machine (host) is little endian.
It is impossible to "detect" the endianity of data in general. Just like it is impossible to detect whether the data is an array of 4 byte integers, or twice that many 2 byte integers. Without any knowledge about the representation, raw data is just a mass of meaningless bits.
However, with some extra knowledge about the data representation, it become possible. Some examples:
Most file formats mandate particular endianity, in which case this is never a problem.
Unicode text files may optionally start with a byte order mark. Same idea can be implemented by other data representations.
Some file formats contain a checksum. You can guess one endianity, and if the checksum does not match, try again with another endianity. It will be unlikely that the checksum matches with wrong interpretation of the data.
Sometimes you can make guesses based on the data. Is the temperature outside 33'554'432 degrees, or maybe 2? You can pick the endianity that represents sane data. Of course, this type of guesswork fails miserably, when the aliens invade and start melting our planet.
You can't tell.
The endianness transformation is essentially an operator E(x) on a number x such that x = E(E(x)). So you don't know "which way round" the x elements are in your file.

What's the best way to hash a string vector not very long (urls)?

I am now dealing with url classification. I partition url with "/?", etc, generating a bunch of parts. In the process, I need to hash the first part to the kth part, say, k=2, then for "http://stackoverflow.com/questions/ask", the key is a string vector "stackoverflow.com questions". Currently, the hash is like Hash. But it consumes a lot of memory. I wonder whether MD5 can help or are there other alternatives. In effect, I do not need to recover the key exactly, as long as differentiating different keys.
Thanks!
It consumes a lot of memory
If your code already works, you may want to consider leaving it as-is. If you don't have a target, you won't know when you're done. Are you sure "a lot" is synonymous with "too much" in your case?
If you decide you really need to change your working code, you should consider the large variety of the options you have available, rather than taking someone's word for a specific algorithm:
http://en.wikipedia.org/wiki/List_of_hash_functions
http://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions
http://www.strchr.com/hash_functions
etc
Not sure about memory implications, and it certainly would change your perf profile, but you could also look into using Tries:
http://en.wikipedia.org/wiki/Trie
MD5 is a nice hash code for stuff where security is not an issue. It's fast and reasonably long (128 bits is enough for most applications). Also the distribution is very good.
Adler32 would be a possible alternative. It's very easy to implement, just a few lines of code. It's even faster then MD5. And it's long enough/good enough for many applications (though for many it is not).
(I know Adler32 is strictly not a hash-code, but it will still do fine for many applications)
However, if storing the hash-code is consuming a lot of memory, you can always truncate the hash-code, or use XOR to "shrink" it. E.g.
uint8_t md5[16];
GetMD5(md5, ...);
// use XOR to shrink the MD5 to 32 bits
for (size_t i = 4; i < 16; i++)
md5[i % 4] ^= md5[i];
// assemble the parts into one uint32_t
uint32_t const hash = md5[0] + (md5[1] << 8) + (md5[2] << 16) + (md5[3] << 24);
Personally I think MD5 would be overkill though. Have a look at Adler32, I think it will do.
EDIT
I have to correct myself: Adler23 is a rather poor choice for short strings (less then a few thousand bytes). I had completely forgotten about that. But there is always the obvious: CRC32. Not as fast as Adler23 (about the same speed as MD5), but still acceptably easy to implement, and there are also a ton of existing implementations with all kinds of licenses out there.
If you're only trying to find out if two URL's are the same have you considered storing a binary version of the IP address of the server? If two server names resolve to the same address is that incorrect or an advantage for your application?

fastest way to write a bitstream on modern x86 hardware

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)
by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.
currently I've got a standard container with a 32bit intermediate buffer to write to
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
Does anyone know of any nice optimizations, fast instructions or other info that may be of use?
Cheers,
I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.
unsigned char* membuff;
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb
// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1; // rounding down by 2.
unsigned int bits = *(unsigned int*)(membuff + byte_offset);
bits >>= bit_pos & 0xF;
bit_pos += bit_cnt;
return bits & BIT_MASKS[bit_cnt];
};
// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1;
*(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
bit_pos += bit_cnt;
};
It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:
Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.
The primary advantages of (1) include:
Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.
The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.
Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.
In particular, you can analyze the likely branching behavior of your algorithm based on two factors:
How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?
For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying1 array. So the branch will be well-predicated and (1) will stay fast.
Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.
Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.
1 This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).
You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.
I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of
using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains)
In essence, you'd be able to do
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics
writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that)
; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile
Update
Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.
Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.

Any way to read big endian data with little endian program?

An external group provides me with a file written on a Big Endian machine, and they also provide a C++ parser for the file format.
I only can run the parser on a little endian machine - is there any way to read the file using their parser without add a swapbytes() call after each read?
Back in the early Iron Age, the Ancients encountered this issue when they tried to network primitive PDP-11 minicomputers with other primitive computers. The PDP-11 was the first little-Endian computer, while most others at the time were big-Endian.
To solve the problem, once and for all, they developed the network byte order concept (always big-Endia), and the corresponding network byte order macros ntohs(), ntohl(), htons(), and htonl(). Code written with those macros will always "get the right answer".
Lean on your external supplier to use the macros in their code, and the file they supply you will always be big-Endian, even if they switch to a little-Endian machine. Rewrite the parser they gave you to use the macros, and you will always be able to read their file, even if you switch to a big-Endian machine.
A truly prodigious amount of programmer time has been wasted on this particular problem. There are days when I think a good argument could be made for hanging the PDP-11 designer who made the little-Endian feature decision.
Try persuading the parser team to include the following code:
int getInt(char* bytes, int num)
{
int ret;
assert(num == 4);
ret = bytes[0] << 24;
ret |= bytes[1] << 16;
ret |= bytes[2] << 8;
ret |= bytes[3];
return ret;
}
it might be more time consuming than a general int i = *(reinterpret_cast<*int>(&myCharArray)); but will always get the endianness right on both big and small endian systems.
In general, there's no "easy" solution to this. You will have to modify the parser to swap the bytes of each and every integer read from the file.
It depends upon what you are doing with the data. If you are going to print the data out, you need to swap the bytes on all the numbers. If you are looking through the file for one or more values, it may be faster to byte swap your comparison value.
In general, Greg is correct, you'll have to do it the hard way.
the best approach is to just define the endianess in the file format, and not say it's machine dependent.
the writer will have to write the bytes in the correct order regardless of the CPU it's running on, and the reader will have to do the same.
You could write a parser that wraps their parser and reverses the bytes, if you don't want to modify their parser.
Be conscious of the types of data being read in. A 4-byte int or float would need endian correction. A 4-byte ASCII string would not.
In general, no.
If the read/write calls are not type aware (which, for example fread and fwrite are not) then they can't tell the difference between writing endian sensitive data and endian insensitive data.
Depending on how the parser is structured you may be able to avoid some suffering, if the I/O functions they use are aware of the types being read/written then you could modify those routines apply the correct endian conversions.
If you do have to modify all the read/write calls then creating just such a routine would be a sensible course of action.
Your question somehow conatins the answer: No!
I only can run the parser on a little endian machine - is there any way to read the file using their parser without add a swapbytes() call after each read?
If you read (and want to interpret) big endian data on a little endian machine, you must somehow and somewhere convert the data. You might do this after each read or after the whole file has been read (if the data read does not contain any information on how to read further data) - but there is no way in omitting the conversion.

Hash of a string to be of specific length

Is there a way to generate a hash of a string so that the hash itself would be of specific length? I've got a function that generates 41-byte hashes (SHA-1), but I need it to be 33-bytes max (because of certain hardware limitations). If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.
Or actually I suppose an MD5 algorithm would fit nicely, if I could find some C code for one with your help.
EDIT: Thank you all for the quick and knowledgeable responses. I've chosen to go with an MD5 hash and it fits fine for my purpose. The uniqueness is an important issue, but I don't expect the number of those hashes to be very large at any given time - these hashes represent software servers on a home LAN, so at max there would be 5, maybe 10 running.
If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.
What makes you think you've got uniqueness now? Yes, there's clearly a higher chance of collision when you're only playing with 33 bytes instead of 41, but you need to be fully aware that collisions are only ever unlikely, not impossible, for any situation where it makes sense to use a hash in the first place. If you're hashing more than 41 bytes of data, there are clearly more possible combinations than there are hashes available.
Now, whether you'd be better off truncating the SHA-1 hash or using a shorter hash such as MD5, I don't know. I think I'd be more generally confident when keeping the whole of a hash, but MD5 has known vulnerabilities which may or may not be a problem for your particular application.
The way hashes are calculated that's unfortunately not possible. To limit the hash length to 33 bytes, you will have to cut it. You could xor the first and last 33 bytes, as that might keep more of the information. But even with 33 bytes you don't have that big a chance of a collision.
md5: http://www.md5hashing.com/c++/
btw. md5 is 16 bytes, sha1 20 bytes and sha256 is 32 bytes, however as hexstrings, they all double in size. If you can store bytes, you can even use sha256.
There is no more chance of collision with substring(sha_hash, 0, 33) than with any other hash that is 33 bytes long, due to the way hash algorithms are designed (entropy is evenly spread out in the resulting string).
You could use an Elf hash(<- C code included) or some other simple hash function like that instead of MD5 or SHA-X.
They are not secure, but they can be tuned to any length you need
/*****Please include following header files*****/
// string
/***********************************************/
/*****Please use following namespaces*****/
// std
/*****************************************/
static unsigned int ELFHash(string str) {
unsigned int hash = 0;
unsigned int x = 0;
unsigned int i = 0;
unsigned int len = str.length();
for (i = 0; i < len; i++)
{
hash = (hash << 4) + (str[i]);
if ((x = hash & 0xF0000000) != 0)
{
hash ^= (x >> 24);
}
hash &= ~x;
}
return hash;
}
Example
string data = "jdfgsdhfsdfsd 6445dsfsd7fg/*/+bfjsdgf%$^";
unsigned int value = ELFHash(data);
Output
248446350
Hashes are by definition only unique for small amount of data (and even then it's still not guaranteed). It is impossible to map a large amount of information uniquely to a small amount of information by virtue of the fact that you can't mmagically get rid of information and get it back later. Keep in mind this isn't compression going on.
Personally, I'd use MD5 (if you need to store in text), or a 256b (32B) hash such as SHA256 (if you can store in binary) in this situation. Truncating another hash algorithm to 33B works too, and MAY increase the possibility of generating hash collisions. It depends alot on the algorithm.
Also, yet another C implementation of MD5, by the people who designed it.
I believe the MD5 hashing algorithm results in a 32 digit number so maybe that one will be more suitable.
Edit: to access MD5 functionality, it should be possible to hook into the openssl libraries. However you mentioned hardware limitations so this may no be possible in your case.
Here is an MD5 implementation in C.
The chance of a 33-byte collision is 1/2^132 (by the Birthday Paradox)
So don't worry about losing uniqueness.
Update: I didn't check the actual byte length of SHA1. Here's the relevant calculation: a 32-nibble collision (33 bytes of hex - 1 termination char), occurs only when the number of strings hashed becomes around sqrt(2^(32*4)) = 2^64.
Use Apache's DigestUtils:
http://commons.apache.org/codec/api-release/org/apache/commons/codec/digest/DigestUtils.html#md5Hex(java.lang.String)
Converts the hash into 32 character hex string.