I am currently writing some glsl like vector math classes in C++, and I just implemented an abs() function like this:
template<class T>
static inline T abs(T _a)
{
return _a < 0 ? -_a : _a;
}
I compared its speed to the default C++ abs from math.h like this:
clock_t begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = abs(-1.25);
};
clock_t end = clock();
unsigned long time1 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = myMath::abs(-1.25);
};
end = clock();
unsigned long time2 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
std::cout<<time1<<std::endl;
std::cout<<time2<<std::endl;
Now the default abs takes about 25ms while mine takes 60. I guess there is some low level optimisation going on. Does anybody know how math.h abs works internally? The performance difference is nothing dramatic, but I am just curious!
Since they are the implementation, they are free to make as many assumptions as they want. They know the format of the double and can play tricks with that instead.
Likely (as in almost not even a question), your double is the binary64 format. This means the sign has it's own bit, and an absolute value is merely clearing that bit. For example, as a specialization, a compiler implementer may do the following:
template <>
double abs<double>(const double x)
{
// breaks strict aliasing, but compiler writer knows this behavior for the platform
uint64_t i = reinterpret_cast<const std::uint64_t&>(x);
i &= 0x7FFFFFFFFFFFFFFFULL; // clear sign bit
return reinterpret_cast<const double&>(i);
}
This removes branching and may run faster.
There are well-known tricks for computing the absolute value of a two's complement signed number. If the number is negative, flip all the bits and add 1, that is, xor with -1 and subtract -1. If it is positive, do nothing, that is, xor with 0 and subtract 0.
int my_abs(int x)
{
int s = x >> 31;
return (x ^ s) - s;
}
What is your compiler and settings? I'm sure MS and GCC implement "intrinsic functions" for many math and string operations.
The following line:
printf("%.3f", abs(1.25));
falls into the following "fabs" code path (in msvcr90d.dll):
004113DE sub esp,8
004113E1 fld qword ptr [__real#3ff4000000000000 (415748h)]
004113E7 fstp qword ptr [esp]
004113EA call abs (4110FFh)
abs call the C runtime 'fabs' implementation on MSVCR90D (rather large):
102F5730 mov edi,edi
102F5732 push ebp
102F5733 mov ebp,esp
102F5735 sub esp,14h
102F5738 fldz
102F573A fstp qword ptr [result]
102F573D push 0FFFFh
102F5742 push 133Fh
102F5747 call _ctrlfp (102F6140h)
102F574C add esp,8
102F574F mov dword ptr [savedcw],eax
102F5752 movzx eax,word ptr [ebp+0Eh]
102F5756 and eax,7FF0h
102F575B cmp eax,7FF0h
102F5760 jne fabs+0D2h (102F5802h)
102F5766 sub esp,8
102F5769 fld qword ptr [x]
102F576C fstp qword ptr [esp]
102F576F call _sptype (102F9710h)
102F5774 add esp,8
102F5777 mov dword ptr [ebp-14h],eax
102F577A cmp dword ptr [ebp-14h],1
102F577E je fabs+5Eh (102F578Eh)
102F5780 cmp dword ptr [ebp-14h],2
102F5784 je fabs+77h (102F57A7h)
102F5786 cmp dword ptr [ebp-14h],3
102F578A je fabs+8Fh (102F57BFh)
102F578C jmp fabs+0A8h (102F57D8h)
102F578E push 0FFFFh
102F5793 mov ecx,dword ptr [savedcw]
102F5796 push ecx
102F5797 call _ctrlfp (102F6140h)
102F579C add esp,8
102F579F fld qword ptr [x]
102F57A2 jmp fabs+0F8h (102F5828h)
102F57A7 push 0FFFFh
102F57AC mov edx,dword ptr [savedcw]
102F57AF push edx
102F57B0 call _ctrlfp (102F6140h)
102F57B5 add esp,8
102F57B8 fld qword ptr [x]
102F57BB fchs
102F57BD jmp fabs+0F8h (102F5828h)
102F57BF mov eax,dword ptr [savedcw]
102F57C2 push eax
102F57C3 sub esp,8
102F57C6 fld qword ptr [x]
102F57C9 fstp qword ptr [esp]
102F57CC push 15h
102F57CE call _handle_qnan1 (102F98C0h)
102F57D3 add esp,10h
102F57D6 jmp fabs+0F8h (102F5828h)
102F57D8 mov ecx,dword ptr [savedcw]
102F57DB push ecx
102F57DC fld qword ptr [x]
102F57DF fadd qword ptr [__real#3ff0000000000000 (1022CF68h)]
102F57E5 sub esp,8
102F57E8 fstp qword ptr [esp]
102F57EB sub esp,8
102F57EE fld qword ptr [x]
102F57F1 fstp qword ptr [esp]
102F57F4 push 15h
102F57F6 push 8
102F57F8 call _except1 (102F99B0h)
102F57FD add esp,1Ch
102F5800 jmp fabs+0F8h (102F5828h)
102F5802 mov edx,dword ptr [ebp+0Ch]
102F5805 and edx,7FFFFFFFh
102F580B mov dword ptr [ebp-0Ch],edx
102F580E mov eax,dword ptr [x]
102F5811 mov dword ptr [result],eax
102F5814 push 0FFFFh
102F5819 mov ecx,dword ptr [savedcw]
102F581C push ecx
102F581D call _ctrlfp (102F6140h)
102F5822 add esp,8
102F5825 fld qword ptr [result]
102F5828 mov esp,ebp
102F582A pop ebp
102F582B ret
In release mode, the FPU FABS instruction is used instead (takes 1 clock cycle only on FPU >= Pentium), the dissasembly output is:
00401006 fld qword ptr [__real#3ff4000000000000 (402100h)]
0040100C sub esp,8
0040100F fabs
00401011 fstp qword ptr [esp]
00401014 push offset string "%.3f" (4020F4h)
00401019 call dword ptr [__imp__printf (4020A0h)]
It probably just uses a bitmask to set the sign bit to 0.
There can be several things:
are you sure the first call uses std::abs? It could just as well use the integer abs from C (either call std::abs explicitely, or have using std::abs;)
the compiler might have intrinsic implementation of some float functions (eg. compile them directly into FPU instructions)
However, I'm surprised the compiler doesn't eliminate the loop altogether - since you don't do anything with any effect inside the loop, and at least in case of abs, the compiler should know there are no side-effects.
Your version of abs is inlined and can be computed once and the compiler can trivially know that the value returned isn't going to change, so it doesn't even need to call the function.
You really need to look at the generated assembly code (set a breakpoint, and open the "large" debugger view, this disassembly will be on the bottom left if memory serves), and then you can see what's going on.
You can find documentation on your processor online without too much trouble, it'll tell you what all of the instructions are so you can figure out what's happening. Alternatively, paste it here and we'll tell you. ;)
Probably the library version of abs is an intrinsic function, whose behavior is exactly known by the compiler, which can even compute the value at compile time (since in your case it's known) and optimize the call away. You should try your benchmark with a value known only at runtime (provided by the user or got with rand() before the two cycles).
If there's still a difference, it may be because the library abs is written directly in hand-forged assembly with magic tricks, so it could be a little faster than the generated one.
The library abs function operates on integers while you are obviously testing floats. This means that call to abs with float argument involves conversion from float to int (may be a no-op as you are using constant and compiler may do it at compile time), then INTEGER abs operation and conversion int->float. You templated function will involve operations on floats and this is probably making a difference.
Related
I am trying to figure out, which implementation has edge over other while finding max number between two. As an example let's examine two implementation:
Implementation 1:
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
jmp .L4 .L2:
mov eax, DWORD PTR [rbp-8] .L4:
pop rbp
ret
Implementation 2:
int findMax(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
sub eax, DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
shr eax, 31
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-8]
imul eax, DWORD PTR [rbp-4]
mov edx, eax
mov eax, DWORD PTR [rbp-20]
sub eax, edx
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-12]
pop rbp
ret
The second one produced more assembly instructions but first one has conditional jump. Just trying to understand if both are equally good.
First you need to turn on compiler optimizations (I used -O2 for the following). And you should compare to std::max. Then this:
#include <algorithm>
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
int findMax2(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
int findMax3(int a,int b){
return std::max(a,b);
}
results in:
findMax(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
findMax2(int, int):
mov ecx, edi
mov eax, edi
sub ecx, esi
mov edx, ecx
shr edx, 31
imul edx, ecx
sub eax, edx
ret
findMax3(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
Your first version results in identical assembly as std::max, while your second variant is doing more. Actually when trying to optimize you need to specify what you optimize for. There are several options that typically require a trade-off to be made: Runtime, memory usage, size of executable, readability of code, etc. Typically you cannot get it all at once.
When in doubt, do not reinvent a wheel but use existing already optimzied std::max. And do not forget that code you write is not instructions for your CPU, rather it is a high level abstract description of what the program should do. Its the compilers job to figure out how that can be achieved best.
Last but not least, your second variant is actually broken. See example here compiled with -O2 -fsanitize=signed-integer-overflow, results in:
/app/example.cpp:13:10: runtime error: signed integer overflow: -2147483648 - 2147483647 cannot be represented in type 'int'
You should favor correctness over speed. The fastest code is not worth a thing when it is wrong. And because of that, readability is next on the list. Code that is difficult to read and understand is also difficult to proove correct. I was only able to spot the problem in your code with the help of the compiler, while std::max(a,b) is unlikely to cause undefined behavior (and even if it does, at least it isnt your fault ;).
For two ints, you can compute max(a, b) without branching using a technique you probably learnt at school:
a ^ ((a ^ b) & -(a < b));
But no sane person would write this in their code. Always use std::max and trust the compiler to pick the best way. You may well find it adopts the above for int arguments with optimisations set appropriately. Although I conject that a compare and jump is probably the best way on the whole, even at the expense of a pipeline dump.
Using std::max gives the compiler the best optimisation hint.
Implementation 1 performs well on a CISC CPU like a modern x64 AMD/Intel CPU.
Implementation 2 performs well on a RISC GPU like from nVIDIA or AMD Graphics.
The term "performs well" is only significant in a tight loop.
I'm having a weird error. I have one module compiled by one compiler (msvc in this case), that calls code loaded from another module compiled by a seperate compiler (TCC).
The tcc code provides a callback function that for both modules are defined like this:
typedef float( * ScaleFunc)(float value, float _min, float _max);
The MSVC code calls the code like this:
finalValue = extScale(val, _min, _max);
000007FEECAFCF52 mov rax,qword ptr [this]
000007FEECAFCF5A movss xmm2,dword ptr [rax+0D0h]
000007FEECAFCF62 mov rax,qword ptr [this]
000007FEECAFCF6A movss xmm1,dword ptr [rax+0CCh]
000007FEECAFCF72 movss xmm0,dword ptr [val]
000007FEECAFCF78 mov rax,qword ptr [this]
000007FEECAFCF80 call qword ptr [rax+0B8h]
000007FEECAFCF86 movss dword ptr [finalValue],xmm0
and the function compiled by TCC looks like this:
float linear_scale(float value, float _min, float _max)
{
return value * (_max - _min) + _min;
}
0000000000503DC4 push rbp
0000000000503DC5 mov rbp,rsp
0000000000503DC8 sub rsp,0
0000000000503DCF mov qword ptr [rbp+10h],rcx
0000000000503DD3 mov qword ptr [rbp+18h],rdx
0000000000503DD7 mov qword ptr [rbp+20h],r8
0000000000503DDB movd xmm0,dword ptr [rbp+20h]
0000000000503DE0 subss xmm0,dword ptr [rbp+18h]
0000000000503DE5 movq xmm1,xmm0
0000000000503DE9 movd xmm0,dword ptr [rbp+10h]
0000000000503DEE mulss xmm0,xmm1
0000000000503DF2 addss xmm0,dword ptr [rbp+18h]
0000000000503DF7 jmp 0000000000503DFC
0000000000503DFC leave
0000000000503DFD ret
It seems that TCC expects the arguments in the integer registers r6 to r8, while msvc puts them in the sse registers. I thought that x64 (on windows) defines one common calling convention? What exactly is going on here, and how can i enforce the same model on both platforms?
The same code works correctly in 32-bit mode. Weirdly enough, on OSX (where the other code is compiled by llvm) it works in both modes (32 and 64-bit). Ill see if i can fetch some assembly from there, later.
---- edit ----
I have created a working solution. It is however, without doubt, the dirtiest hack i've ever made (bar questionable inline assembly, unfortunately it isn't available on msvc 64-bit :)).
// passes first three floating point arguments in r6 to r8
template<typename sseType>
sseType TCCAssemblyHelper(ScaleFunc cb, sseType val, sseType _min, sseType _max)
{
sseType xmm0(val), xmm1(_min), xmm2(_max);
long long rcx, rdx, r8;
rcx = *(long long*)&xmm0;
rdx = *(long long*)&xmm1;
r8 = *(long long*)&xmm2;
typedef float(*interMedFunc)(long long rcx, long long rdx, long long r8);
interMedFunc helperFunc = reinterpret_cast<interMedFunc>(cb);
return helperFunc(rcx, rdx, r8);
}
How are types like int64_t implemented on the lowest i.e. assembly level? I'm using a 32 bit machine and still can use int64_t for example. My initial assumption is that the 64 bit are just simulated and thus there must be quite some overhead for computation with these types in comparison with 32 bit data types when being on a 32 bit machine.
Thank you in advance and Regards
You are right, when you compile code for 32 bit architectures, you have to simulate 64 bit operands and operations using 32 bit operands.
An 8-byte variable (uint64_t which is just a typedef for long long) is stored in 2 4-byte registers.
For adding (and subtracting), you have to first add the lower 4 bytes and then perform a second add with carry (or a subtract with borrow) on the higher 4 bytes. Since the second add also adds the carry from the first add, the result is correct. The overhead for adding and subtracting is not much.
For multiplying and division however, things aren't that simple. Usually a routine is called to perform these kind of operations, and the overhead is significantly bigger.
Lets take this simple c code:
int main() {
long long a = 0x0102030405060708;
long long b = 0xA1A2A3A4A5A6A7A8;
long long c = 0xB1B2B3B4B5B6B7B8;
c = a + b;
c = a - b;
c = a * b;
c = a / b;
return 0;
}
Analyzing the assembly generated by MSVC we can see:
2: long long a = 0x0102030405060708;
012D13DE mov dword ptr [a],5060708h
012D13E5 mov dword ptr [ebp-8],1020304h
3: long long b = 0xA1A2A3A4A5A6A7A8;
012D13EC mov dword ptr [b],0A5A6A7A8h
012D13F3 mov dword ptr [ebp-18h],0A1A2A3A4h
4: long long c = 0xB1B2B3B4B5B6B7B8;
012D13FA mov dword ptr [c],0B5B6B7B8h
012D1401 mov dword ptr [ebp-28h],0B1B2B3B4h
a 64 bit variable is split in 2 32-bit locations.
6: c = a + b;
012D1408 mov eax,dword ptr [a]
012D140B add eax,dword ptr [b]
012D140E mov ecx,dword ptr [ebp-8]
012D1411 adc ecx,dword ptr [ebp-18h]
012D1414 mov dword ptr [c],eax
012D1417 mov dword ptr [ebp-28h],ecx
7: c = a - b;
012D141A mov eax,dword ptr [a]
012D141D sub eax,dword ptr [b]
012D1420 mov ecx,dword ptr [ebp-8]
012D1423 sbb ecx,dword ptr [ebp-18h]
012D1426 mov dword ptr [c],eax
012D1429 mov dword ptr [ebp-28h],ecx
a sum is performed with an add instruction on the lower 32 bits and then with an adc (add with carry) for the higher 32 bits. Subtraction is similar: the second operation is sbb (subtract with borrow).
8: c = a * b;
012D142C mov eax,dword ptr [ebp-18h]
012D142F push eax
012D1430 mov ecx,dword ptr [b]
012D1433 push ecx
012D1434 mov edx,dword ptr [ebp-8]
012D1437 push edx
012D1438 mov eax,dword ptr [a]
012D143B push eax
012D143C call __allmul (012D105Ah)
012D1441 mov dword ptr [c],eax
012D1444 mov dword ptr [ebp-28h],edx
9: c = a / b;
012D1447 mov eax,dword ptr [ebp-18h]
012D144A push eax
012D144B mov ecx,dword ptr [b]
012D144E push ecx
012D144F mov edx,dword ptr [ebp-8]
012D1452 push edx
012D1453 mov eax,dword ptr [a]
012D1456 push eax
012D1457 call __alldiv (012D1078h)
012D145C mov dword ptr [c],eax
012D145F mov dword ptr [ebp-28h],edx
The product and division are performed by calling special routines.
I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.
Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.
I tested my idea using the following C code:
#include <stdio.h>
#include <time.h>
int main()
{
unsigned int x, y, z, i;
clock_t t1, t2;
x = y = 0x90000000;
t1 = clock();
for(i = 0; i <20000000 ; i++)
z = (x + y) % 0x100000000ULL;
t2 = clock();
printf("%x\n", z);
printf("%u\n", (int)(t2-t1));
return 0;
}
Compiling using GCC with the following options (I used -O0 to prevent GCC from unfolding the loop):
-S -masm=intel -O0
The relevant part of the resulting assembly code is:
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add eax, edx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
As is evident, no modular arithmetic whatsoever is involved.
Now, if we change the modular addition line of the C code to:
z = (x + y) % 0x0F0000000ULL;
The assembly code changes to (only the relevant part is shown):
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add edx, eax
cmp edx, -268435456
setae al
movzx eax, al
mov DWORD PTR [esp+44], eax
mov ecx, DWORD PTR [esp+44]
mov eax, 0
sub eax, ecx
sal eax, 28
mov ecx, edx
sub ecx, eax
mov eax, ecx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
Obviously, a great number of instructions were added between the two calls to _clock.
Considering the increased number of assembly instructions,
I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.
Could you please provide me with an explanation?
Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!
Art nailed it.
For the power-of-2 modulus, the code for the computation generated with -O0 and -O3 is identical, the difference is the loop-control code, and the running time differs by a factor of 3.
For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.
Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.
So the running time is completely dominated by the loop control code
mov DWORD PTR [esp+40], 0
jmp L2
L3:
# snip
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.
So when compiled with -O0, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.
The difference between the two boils down to the fact that divisions by powers of 2 can be transformed easily in logic instruction.
a/n where n is power of two is equivalent to a >> log2 n
for the modulo it's the same
a mod n can be rendered by a & (n-1)
But in your case it goes even further than that:
your value 0x100000000ULL is 2^32. This means that any unsigned 32bit variable will automatically be a modulo 2^32 value.
The compiler was smart enough to remove the operation because it is an unnecessary operation on 32 bit variables. The ULL specifier
doesn't change that fact.
For the value 0x0F0000000 which fits in a 32 bit variable, the compiler can not elide the operation. It uses a transformation that
seems faster than a division operation.
Im currently trying to get used to assembler and I have written a for loop in c++ and then I have looked at it in disassembly. I was wondering if anyone could explain to me what each step does and/or how to improve the loop manually.
for (int i = 0; i < length; i++){
013A17AE mov dword ptr [i],0
013A17B5 jmp encrypt_chars+30h (13A17C0h)
013A17B7 mov eax,dword ptr [i]
013A17BA add eax,1
013A17BD mov dword ptr [i],eax
013A17C0 mov eax,dword ptr [i]
013A17C3 cmp eax,dword ptr [length]
013A17C6 jge encrypt_chars+6Bh (13A17FBh)
temp_char = OChars [i]; // get next char from original string
013A17C8 mov eax,dword ptr [i]
013A17CB mov cl,byte ptr OChars (13AB138h)[eax]
013A17D1 mov byte ptr [temp_char],cl
Thanks in advance.
First, I'd note that what you've posted seems to contain only part of the loop body. Second, it looks like you compiled with all optimization turned off -- when/if you turn on optimization, don't be surprised if the result looks rather different.
That said, let's look at the code line-by-line:
013A17AE mov dword ptr [i],0
This is basically just i=0.
013A17B5 jmp encrypt_chars+30h (13A17C0h)
This is going to the beginning of the loop. Although it's common to put the test at the top of a loop in most higher level languages, that's not always the case in assembly language.
013A17B7 mov eax,dword ptr [i]
013A17BA add eax,1
013A17BD mov dword ptr [i],eax
This is i++ in (extremely sub-optimal) assembly language. It's retrieving the current value of i, adding one to it, then storing the result back into i.
013A17C0 mov eax,dword ptr [i]
013A17C3 cmp eax,dword ptr [length]
013A17C6 jge encrypt_chars+6Bh (13A17FBh)
This is basically if (i==length) /* skip forward to some code you haven't shown */ It's retrieving the value of i and comparing it to the value of length, the jumping somewhere if i was greater than or equal to length.
If you were writing this in assembly language by hand, you'd normally use something like xor eax, eax (or sub eax, eax) to zero a register. In most cases, you'd start from the maximum and count down to zero if possible (avoids a comparison in the loop). You certainly wouldn't store a value into a variable, then immediately retrieve it back out (in fairness, a compiler probably won't do that either, if you turn on optimization).
Applying that, and moving the "variables" into registers, we'd end up with something on this general order:
mov ecx, length
loop_top:
; stuff that wasn't pasted goes here
dec ecx
jnz loop_top
I'll try to interpret this in plain english:
013A17AE mov dword ptr [i],0 ; Move into i, 0
013A17B5 jmp encrypt_chars+30h (13A17C0h) ; Jump to check
013A17B7 mov eax,dword ptr [i] ; Load i into the accumulator (register eax)
013A17BA add eax,1 ; Increment the accumulator
013A17BD mov dword ptr [i],eax ; and put that in it, effectively adding
; 1 to i.
check:
013A17C0 mov eax,dword ptr [i] ; Move i into the accumulator
013A17C3 cmp eax,dword ptr [length] ; Compare it to the value in 'length',
; setting flags
013A17C6 jge encrypt_chars+6Bh (13A17FBh) ; Jump if it's greater or equal. This
; address is not in your code snippet
The compiler preferes EAX for arithmetic. Each register (in the past, I don't know if this is still current) has some type of operation that it is faster at doing.
Here's the part that should be more optimized:
(note: your compiler SHOULD do this, so either you have optimizations turned off, or something in the loop body is preventing this optimization)
mov eax,dword ptr [i] ; Go get "i" from memory, put it in register EAX
add eax,1 ; Add one to register EAX
mov dword ptr [i],eax ; Put register EAX back in memory "i". (now one bigger)
mov eax,dword ptr [i] ; Go get "i" from memory, put it in EAX again.
See how often you're moving values back-n-forth from memory to EAX?
You should be able to load "i" into EAX once at the beginning of the loop, run the full loop directly out of EAX, and put the finished value back into "i" after its all done.
(unless something else in your code prevents this)
Anyway, this code comes from DEBUG build. It is possible to optimize it, but MS compiler produces very good code for such simple cases.
There is no point to do it manually, just re-build it in release mode and read the listing to learn how to do it.