MD5 Code Coverage - c++

I'm currently implementing an MD5 hash algorithm based RSA data security code, in the UpdateData method there is a section which reads:
mCount[0] += (length << 3);
if (mCount[0] < (length << 3))
{
mCount[1]++;
}
I'm trying at the moment to understand how the if statement could ever evaluate to true (the mCount[0] value is initialised to 0). Any help would be greatly appreciated.
Thanks

It can happen if there is an overflow of the mCount[0] variable.
unsigned int i = 4294967295;//2^32-1
unsigned int j = 1;
i += j;
assert(i < j);
The block of code you mentioned is probably called multiple times, depending on how much data there is to process. So mCount[0] will eventually overflow.

This is for carry propagation, the sum of length*8 is stored in two 32-bit words (here, mCount is likely an array of unsigned int) mCount[1]:mCount[0].
lo += a
if (lo < a) hi++; // true if overflow occurs: lo + a >= 2^32
is equivalent to 64-bit operation:
(hi:lo) += (0:a)

This will happen whenever mCount[0] is negative before the addition and if the addition itself does not overflow.

I know this is not technically an answer to your programming question, but I honestly think the most valuable advice I can give you is that you should use a widely-used and well-vetted MD5 (or any other crypto) algorithm, DO NOT roll your own. How can I put this delicately... this advice is doubly true if you are asking questions about integer math. The road to hell is littered with the bodies of people who tried to implement tricky crypto themselves without fully understanding what they were doing, and ended up leaving gaping security holes in the process. Be smart, use somebody else's debugged implementation, use your own valuable time to implement parts of the system you can't get from someplace else.

As Eric and Brian point out, this happens has when mCount[0] overflows, which happens every 500mb (2^29), so if you hash files/data streams larger than 500mb you will see this code trigger.
Using two 32bit counters, allows for 2^61 bytes of input before the counter truly overflows.

Related

VOLUME_BITMAP_BUFFER - Accessing the BYTE Buffer?

As the title suggests, I'm doing some low-level volume access (as close to C code as possible) to make a disk clone program, I want to set all the free clusters to zero to use a simple compression and keep the size small.
I've been beating my head off a wall forever trying to figure out why I can't get the FSCTL_GET_VOLUME_BITMAP function working properly... so if possible please don't link me to any external reading as I've probably already been there and its either been C#, invalid links, or has no explanation I am looking for.
I want to understand the buffer itself more than i need the actual code.
The simplest way I can ask is what is the proper way to read from an array with a length of [1] in C/C++ like the one used by VOLUME_BITMAP_BUFFER?
The only way I can even assign anything to it is by recreating it with my own huge buffer and I still end up with errors, even after locking the volume in Recovery mode. I do get all needed permissions to access the raw disk just on a side note.
I know I'm likely missing some fundamental basic in C++ that would allow me to read from the memory its stored in, but I just can't figure it out without getting errors.
In case I happen to be looping through the bytes wrong which is causing my error, I added how I was doing it...although that still leaves me with the Buffer question.
I know you can call multiple times, but I have to assume its not 8 bytes at a time.
Something like this (pardon my code..I typed this on my phone so it likely has errors)...I tried adding any relevant cause of failure in case, but the buffer is the real question.
#define BYTE_MASK = 0x80;
#define BITS_PER_BYTE = 8;
void function foo() {
const int BUFFER_SIZE = 268435456;
struct {
LARGE_INTEGER StartingLcn;
LARGE_INTEGER BitmapSize;
BYTE Buffer[BUFF_SIZE];
} volBuff;
// I want to use VOLUME_BITMAP_BUFFER
/* Part of a larger loop checking for errors and more data
BYTE Mask = 1;
BOOL b = DeviceIoControl(vol, FSCTL_GET-VOLUME_BITMAP, &lcnStart, sizeof(STARTING_LCN_INPUT_BUFFER), &volBuff, sizeof(volBuff), &dwRet);
*/
for (x = 0; x < (bmpLen / BITS_PER_BYTE;) {
if ((volBuff.Buffer[x] & Mask) != 0) {
NotFree++;
} else {
FreeSpc++;
}
// I did try not dividing the size
if (Mask == BYTE_MASK) {
Mask = 1;
x++;
} else {
Mask = (Mask << 1);
}
}
return;
}
I've literally put an entire project on hold for I don't even know how long just being stubborn at this point...and I can't find any answer that actually explains the buffer, forget the rest of the process.
If someone wants to be more thorough I won't complain after all my attempts, but the buffer is driving me crazy at this point.
Any help would be greatly appreciated.
I can safely assume the one answer I was given
"...array with a length of [1]..." there is no way in Standard C++ of accessing the additional bytes. You can either: (a) pray that your compiler can do this (via an extension) or (b) write a C module (where this is well defined) that you can call from C++. - Richard Critton"
Was as correct of an answer I can expect after my extensive attempts to make this work any other way, especially since I was only able to make my own array work using standard C and not C++ directly.
I wanted to put a close on this since my computer is out of use for a bit.
If the problem continues after I dig through some examples for defragmenting in C that I FINALLY came across I'll ask a more direct question with full code to support it.
That answer was enough to remove the wall I had hit and get me thinking again. I thank you for that.

Optimizing subroutine with memcmp, memcpy

I wonder if there are any optimizations (something more efficient than memcmp/memcpy maybe just using a for loop or breaking it down to fast assembly instructions) that can be done to this subroutine. NUM_BYTES is a constant value (always = 18):
void ledSmoothWrite(uint8_t ledTarget[])
{
// If the new target is different, set new target
if(memcmp(target_arr, ledTarget, NUM_BYTES)) memcpy(target_arr, ledTarget, NUM_BYTES);
// Obtain equality
for(uint8_t i = 0; i < NUM_BYTES; i++)
{
if(rgb_arr[i] < target_arr[i]) rgb_arr[i]++;
else if(rgb_arr[i] > target_arr[i]) rgb_arr[i]--;
}
render();
}
This subroutine smoothly setting LED colors might be called several hundred times per second. As the loop() function increases in run time it takes much more time for each LED to get desired values.
Any help would be greatly appreciated. Thank you in advance!
Check your documentation but on many good compilers memcmp() and memcpy() are implemented as efficient machine code instructions.
They may well be (for practical purposes) as fast as it gets.
Try not doing the comparison. Depending on the probability that the ranges are equal doing the comparison then (if different) doing the copy may not be a net win.
However the best solution is to not perform the copy at all!
If possible just read out of ledTarget.
It's not exactly clear what you're doing but animations often perform 'double buffering' to avoid copying big states around the place.
So if you're working concurrently write into one buffer while reading from another and then on the next cycle write into the other buffer and read from the first.

What's the best way to hash a string vector not very long (urls)?

I am now dealing with url classification. I partition url with "/?", etc, generating a bunch of parts. In the process, I need to hash the first part to the kth part, say, k=2, then for "http://stackoverflow.com/questions/ask", the key is a string vector "stackoverflow.com questions". Currently, the hash is like Hash. But it consumes a lot of memory. I wonder whether MD5 can help or are there other alternatives. In effect, I do not need to recover the key exactly, as long as differentiating different keys.
Thanks!
It consumes a lot of memory
If your code already works, you may want to consider leaving it as-is. If you don't have a target, you won't know when you're done. Are you sure "a lot" is synonymous with "too much" in your case?
If you decide you really need to change your working code, you should consider the large variety of the options you have available, rather than taking someone's word for a specific algorithm:
http://en.wikipedia.org/wiki/List_of_hash_functions
http://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions
http://www.strchr.com/hash_functions
etc
Not sure about memory implications, and it certainly would change your perf profile, but you could also look into using Tries:
http://en.wikipedia.org/wiki/Trie
MD5 is a nice hash code for stuff where security is not an issue. It's fast and reasonably long (128 bits is enough for most applications). Also the distribution is very good.
Adler32 would be a possible alternative. It's very easy to implement, just a few lines of code. It's even faster then MD5. And it's long enough/good enough for many applications (though for many it is not).
(I know Adler32 is strictly not a hash-code, but it will still do fine for many applications)
However, if storing the hash-code is consuming a lot of memory, you can always truncate the hash-code, or use XOR to "shrink" it. E.g.
uint8_t md5[16];
GetMD5(md5, ...);
// use XOR to shrink the MD5 to 32 bits
for (size_t i = 4; i < 16; i++)
md5[i % 4] ^= md5[i];
// assemble the parts into one uint32_t
uint32_t const hash = md5[0] + (md5[1] << 8) + (md5[2] << 16) + (md5[3] << 24);
Personally I think MD5 would be overkill though. Have a look at Adler32, I think it will do.
EDIT
I have to correct myself: Adler23 is a rather poor choice for short strings (less then a few thousand bytes). I had completely forgotten about that. But there is always the obvious: CRC32. Not as fast as Adler23 (about the same speed as MD5), but still acceptably easy to implement, and there are also a ton of existing implementations with all kinds of licenses out there.
If you're only trying to find out if two URL's are the same have you considered storing a binary version of the IP address of the server? If two server names resolve to the same address is that incorrect or an advantage for your application?

C++, using one byte to store two variables

I am working on representation of the chess board, and I am planning to store it in 32 bytes array, where each byte will be used to store two pieces. (That way only 4 bits are needed per piece)
Doing it in that way, results in a overhead for accessing particular index of the board.
Do you think that, this code can be optimised or completely different method of accessing indexes can be used?
c++
char getPosition(unsigned char* c, int index){
//moving pointer
c+=(index>>1);
//odd number
if (index & 1){
//taking right part
return *c & 0xF;
}else
{
//taking left part
return *c>>4;
}
}
void setValue(unsigned char* board, char value, int index){
//moving pointer
board+=(index>>1);
//odd number
if (index & 1){
//replace right part
//save left value only 4 bits
*board = (*board & 0xF0) + value;
}else
{
//replacing left part
*board = (*board & 0xF) + (value<<4);
}
}
int main() {
char* c = (char*)malloc(32);
for (int i = 0; i < 64 ; i++){
setValue((unsigned char*)c, i % 8,i);
}
for (int i = 0; i < 64 ; i++){
cout<<(int)getPosition((unsigned char*)c, i)<<" ";
if (((i+1) % 8 == 0) && (i > 0)){
cout<<endl;
}
}
return 0;
}
I am equally interested in your opinions regarding chess representations, and optimisation of the method above, as a stand alone problem.
Thanks a lot
EDIT
Thanks for your replies. A while ago I created checkers game, where I was using 64 bytes board representation. This time I am trying some different methods, just to see what I like. Memory is not such a big problem. Bit-boards is definitely on my list to try. Thanks
That's the problem with premature optimization. Where your chess board would have taken 64 bytes to store, now it takes 32. What has this really boughten you? Did you actually analyze the situation to see if you needed to save that memory?
Assuming that you used one of the least optimal search method, straight AB search to depth D with no heuristics, and you generate all possible moves in a position before searching, then absolute maximum memory required for your board is going to be sizeof(board) * W * D. If we assume a rather large W = 100 and large D = 30 then you're going to have 3000 boards in memory at depth D. 64k vs 32k...is it really worth it?
On the other hand, you've increased the amount of operations necessary to access board[location] and this will be called many millions of times per search.
When building chess AI's the main thing you'll end up looking for is cpu cycles, not memory. This may vary a little bit if you're targeting a cell phone or something, but even at that you're going to worry more about speed before you'll ever reach enough depth to cause any memory issues.
As to which representation I prefer...I like bitboards. Haven't done a lot of serious measurements but I did compare two engines I made, one bitboard and one array, and the bitboard one was faster and could reach much greater depths than the other.
Let me be the first to point out a potential bug (depending on compilers and compiler settings). And bugs being why premature optimization is evil:
//taking left part
return *c>>4;
if *c is negative, then >> may repeat the negative high bit. ie in binary:
0b10100000 >> 4 == 0b11111010
for some compilers (ie the C++ standard leaves it to the compiler to decide - both whether to carry the high bit, and whether a char is signed or unsigned).
If you do want to go forward with your packed bits (and let me say that you probably shouldn't bother, but it is up to you), I would suggest wrapping the packed bits into a class, and overriding [] such that
board[x][y]
gives you the unpacked bits. Then you can turn the packing on and off easily, and having the same syntax in either case. If you inline the operator overloads, it should be as efficient as the code you have now.
Well, 64 bytes is a very small amount of RAM. You're better off just using a char[8][8]. That is, unless you plan on storing a ton of chess boards. Doing char[8][8] makes it easier (and faster) to access the board and do more complex operations on it.
If you're still interested in storing the board in packed representation (either for practice or to store a lot of boards), I say you're "doing it right" regarding the bit operations. You may want to consider inlining your accessors if you're going for speed using the inline keyword.
Is space enough of a consideration where you can't just use a full byte to represent a square? That would make accesses easier to follow on the program and additionally most likely faster as the bit manipulations are not required.
Otherwise to make sure everything goes smoothly I would make sure all your types are unsigned: getPosition return unsigned char, and qualify all your numeric literals with "U" (0xF0U for example) to make sure they're always interpreted as unsigned. Most likely you won't have any problems with signed-ness but why take chances on some architecture that behaves unexpectedly?
Nice code, but if you are really that deep into performance optimization, you should probably learn more about your particular CPU architecture.
AFAIK, you may found that storing a chess piece in as much 8 bytes will be more efficient. Even if you recurse 15 moves deep, L2 cache size would hardly be a constraint, but RAM misalignment may be. I would guess that proper handling of a chess board would include Expand() and Reduce() functions to translate between board representations during different parts of the algorithm: some may be faster on compact representation, and some vice versa. For example, caching, and algorithms involving hashing by composition of two adjacent cells might be good for the compact structure, all else no.
I would also consider developing some helper hardware, like some FPGA board, or some GPU code, if performance is so important..
As a chess player, I can tell you: There's more to a position than the mere placement of each piece. You have to take in to consideration some other things:
Which side has to move next?
Can white and/or black castle king and/or queenside?
Can a pawn be taken en passant?
How many moves have passed since the last pawn move and/or capturing move?
If the data structure you use to represent a position doesn't reflect this information, then you're in big trouble.

What is faster (x < 0) or (x == -1)?

Variable x is int with possible values: -1, 0, 1, 2, 3.
Which expression will be faster (in CPU ticks):
1. (x < 0)
2. (x == -1)
Language: C/C++, but I suppose all other languages will have the same.
P.S. I personally think that answer is (x < 0).
More widely for gurus: what if x from -1 to 2^30?
That depends entirely on the ISA you're compiling for, and the quality of your compiler's optimizer. Don't optimize prematurely: profile first to find your bottlenecks.
That said, in x86, you'll find that both are equally fast in most cases. In both cases, you'll have a comparison (cmp) and a conditional jump (jCC) instructions. However, for (x < 0), there may be some instances where the compiler can elide the cmp instruction, speeding up your code by one whole cycle.
Specifically, if the value x is stored in a register and was recently the result of an arithmetic operation (such as add, or sub, but there are many more possibilities) that sets the sign flag SF in the EFLAGS register, then there's no need for the cmp instruction, and the compiler can emit just a js instruction. There's no simple jCC instruction that jumps when the input was -1.
Try it and see! Do a million, or better, a billion of each and time them. I bet there is no statistical significance in your results, but who knows -- maybe on your platform and compiler, you might find a result.
This is a great experiment to convince yourself that premature optimization is probably not worth your time--and may well be "the root of all evil--at least in programming".
Both operations can be done in a single CPU step, so they should be the same performance wise.
x < 0 will be faster. If nothing else, it prevents fetching the constant -1 as an operand.
Most architectures have special instructions for comparing against zero, so that will help too.
It could be dependent on what operations precede or succeed the comparison. For example, if you assign a value to x just before doing the comparison, then it might be faster to check the sign flag than to compare to a specific value. Or the CPU's branch-prediction performance could be affected by which comparison you choose.
But, as others have said, this is dependent upon CPU architecture, memory architecture, compiler, and a lot of other things, so there is no general answer.
The important consideration, anyway, is which actually directs your program flow accurately, and which just happens to produce the same result?
If x is actually and index or a value in an enum, then will -1 always be what you want, or will any negative value work? Right now, -1 is the only negative, but that could change.
You can't even answer this question out of context. If you try for a trivial microbenchmark, it's entirely possible that the optimizer will waft your code into the ether:
// Get time
int x = -1;
for (int i = 0; i < ONE_JILLION; i++) {
int dummy = (x < 0); // Poof! Dummy is ignored.
}
// Compute time difference - in the presence of good optimization
// expect this time difference to be close to useless.
Same, both operations are usually done in 1 clock.
It depends on the architecture, but the x == -1 is more error-prone. x < 0 is the way to go.
As others have said there probably isn't any difference. Comparisons are such fundamental operations in a CPU that chip designers want to make them as fast as possible.
But there is something else you could consider. Analyze the frequencies of each value and have the comparisons in that order. This could save you quite a few cycles. Of course you still need to compile your code to asm to verify this.
I'm sure you're confident this is a real time-taker.
I would suppose asking the machine would give a more reliable answer than any of us could give.
I've found, even in code like you're talking about, my supposition that I knew where the time was going was not quite correct. For example, if this is in an inner loop, if there is any sort of function call, even an invisible one inserted by the compiler, the cost of that call will dominate by far.
Nikolay, you write:
It's actually bottleneck operator in
the high-load program. Performance in
this 1-2 strings is much more valuable
than readability...
All bottlenecks are usually this
small, even in perfect design with
perfect algorithms (though there is no
such). I do high-load DNA processing
and know my field and my algorithms
quite well
If so, why not to do next:
get timer, set it to 0;
compile your high-load program with (x < 0);
start your program and timer;
on program end look at the timer and remember result1.
same as 1;
compile your high-load program with (x == -1);
same as 3;
on program end look at the timer and remember result2.
compare result1 and result2.
You'll get the Answer.