How do I speed up a for loop with unrelated data? - c++

A very simple example of passing an array of integers to a for loop shown below. If those integers are unrelated to each other, how can I make it so that a "for loop" iterates over all of them at the same time?
int waffles[3] = { 0 };
for (int i = 0; i < 3; i++) {
waffles[i] = i;
}
What I get
clock 1: waffles[0] = 0;
clock 2: waffles[1] = 1;
clock 3: waffles[2] = 2;
What I want
clock 1: waffles[0] = 0, waffles[1] = 1, waffles[2] = 2

This can actually be done using SIMD instructions like the AVX instructions, although it not trivial to implement. You probably want to 100% make sure you are bottlenecked by a specific loop and really NEED to improve performance there.
This might help https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/
(I know this is not a full answer, but I can't comment yet and it might help anyway)

As #François Andrieux comment points out:
The compiler will very likely unroll that loop to the most efficient form for the targeted platform.
See how this code compiles in Godbolt's Compler Explorer here.
Clang puts 0 and 1 using the same instruction:
movabs rax, 4294967296
mov qword ptr [rsp + 12], rax
mov dword ptr [rsp + 20], 2
gcc puts 1 and 2 using the same instruction:
mov DWORD PTR [rsp], 0
mov QWORD PTR [rsp+4], rax
Larger array would result in vectored instructions that put even more data at once (see here)

Related

Translating C++ x86 Inline assembly code to C++

I've been struggling trying to convert this assembly code to C++ code.
It's a function from an old game that takes pixel data Stmp, and I believe it places it to destination void* dest
void Function(int x, int y, int yl, void* Stmp, void* dest)
{
unsigned long size = 1280 * 2;
unsigned long j = yl;
void* Dtmp = (void*)((char*)dest + y * size + (x * 2));
_asm
{
push es;
push ds;
pop es;
mov edx,Dtmp;
mov esi,Stmp;
mov ebx,j;
xor eax,eax;
xor ecx,ecx;
loop_1:
or bx,bx;
jz exit_1;
mov edi,edx;
loop_2:
cmp word ptr[esi],0xffff;
jz exit_2;
mov ax,[esi];
add edi,eax;
mov cx,[esi+2];
add esi,4;
shr ecx,2;
jnc Next2;
movsw;
Next2:
rep movsd;
jmp loop_2;
exit_2:
add esi,2;
add edx,size;
dec bx;
jmp loop_1;
exit_1:
pop es;
};
}
That's where I've gotten as far to: (Not sure if it's even correct)
while (j > 0)
{
if (*stmp != 0xffff)
{
}
++stmp;
dtmp += size;
--j;
}
Any help is greatly appreciated. Thank you.
It saves / restores ES around setting it equal to DS so rep movsd will use the same addresses for load and store. That instruction is basically memcpy(edi, esi, ecx) but incrementing the pointers in EDI and ESI (by 4 * ecx). https://www.felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq
In a flat memory model, you can totally ignore that. This code looks like it might have been written to run in 16-bit unreal mode, or possibly even real mode, hence the use of 16-bit registers all over the place.
Look like it's loading some kind of records that tell it how many bytes to copy, and reading until the end of the record, at which point it looks for the next record there. There's an outer loop around that, looping through records.
The records look like this I think:
struct sprite_line {
uint16_t skip_dstbytes, src_bytes;
uint16_t src_data[]; // flexible array member, actual size unlimited but assumed to be a multiple of 2.
};
The inner loop is this:
;; char *dstp; // in EDI
;; struct spriteline *p // in ESI
loop_2:
cmp word ptr[esi],0xffff ; while( p->skip_dstbytes != (uint16_t)-1 ) {
jz exit_2;
mov ax,[esi]; ; EAX was xor-zeroed earlier; some old CPUs maybe had slow movzx loads
add edi,eax; ; dstp += p->skip_dstbytes;
mov cx,[esi+2]; ; bytelen = p->src_len;
add esi,4; ; p->data
shr ecx,2; ; length in dwords = bytelen >> 2
jnc Next2;
movsw; ; one 16-bit (word) copy if bytelen >> 1 is odd, i.e. if last bit shifted out was a 1.
; The first bit shifted out isn't checked, so size is assumed to be a multiple of 2.
Next2:
rep movsd; ; copy in 4-byte chunks
Old CPUs (before IvyBridge) had rep movsd faster than rep movsb, otherwise this code could just have done that.
or bx,bx;
jz exit_1;
That's an obsolete idiom that comes from 8080 for test bx,bx / jnz, i.e. jump if BX was zero. So it's a while( bx != 0 ) {} loop. With dec bx in it. It's an inefficient way to write a while (--bx) loop; a compiler would put a dec/jnz .top_of_loop at the bottom, with a test once outside the loop in case it needs to run zero times. Why are loops always compiled into "do...while" style (tail jump)?
Some people would say that's what a while loop looks like in asm, if they're picturing totally naive translation from C to asm.

Optimizing the backward solve for a sparse lower triangular linear system

I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in
(A + I)' * x = b
This is the routine I have for computing this:
void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i]; j<Lp[i+1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line
x[i] -= Lx[j] * x[Li[j]];
being the bulk of the time spent. Compiling with gcc-8.3 -O3 -mfma -mavx -mavx512f gives
backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret
According to vtune,
vmovsd QWORD PTR [r9], xmm0
is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked.
Processor: Xeon Skylake
Question What can I do to optimize this function?
This should depend quite a bit on the exact sparsity pattern of the matrix and the platform being used. I tested a few things with gcc 8.3.0 and compiler flags -O3 -march=native (which is -march=skylake on my CPU) on the lower triangle of this matrix of dimension 3006 with 19554 nonzero entries. Hopefully this is somewhat close to your setup, but in any case I hope these can give you an idea of where to start.
For timing I used google/benchmark with this source file. It defines benchBacksolveBaseline which benchmarks the implementation given in the question and benchBacksolveOptimized which benchmarks the proposed "optimized" implementations. There is also benchFillRhs which separately benchmarks the function that is used in both to generate some not completely trivial values for the right hand side. To get the time of the "pure" backsolves, the time that benchFillRhs takes should be subtracted.
1. Iterating strictly backwards
The outer loop in your implementation iterates through the columns backwards, while the inner loop iterates through the current column forwards. Seems like it would be more consistent to iterate through each column backwards as well:
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
This barely changes the assembly (https://godbolt.org/z/CBZAT5), but the benchmark timings show a measureable improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2734 ns 5120000
benchBacksolveBaseline 17412 ns 17421 ns 829630
benchBacksolveOptimized 16046 ns 16040 ns 853333
I assume this is caused by somehow more predictable cache access, but I did not look into it much further.
2. Less loads/stores in inner loop
As A is lower triangular, we have i < Li[j]. Therefore we know that x[Li[j]] will not change due to the changes to x[i] in the inner loop. We can put this knowledge into our implementation by using a temporary variable:
for (int i=n-1; i>=0; --i) {
double xi_temp = x[i];
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
xi_temp -= Lx[j] * x[Li[j]];
}
x[i] = xi_temp;
}
This makes gcc 8.3.0 move the store to memory from inside the inner loop to directly after its end (https://godbolt.org/z/vM4gPD). The benchmark for the test matrix on my system shows a small improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2740 ns 5120000
benchBacksolveBaseline 17410 ns 17418 ns 814545
benchBacksolveOptimized 15155 ns 15147 ns 887129
3. Unroll the loop
While clang already starts unrolling the loop after the first suggested code change, gcc 8.3.0 still has not. So let's give that a try by additionally passing -funroll-loops.
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2733 ns 2734 ns 5120000
benchBacksolveBaseline 15079 ns 15081 ns 953191
benchBacksolveOptimized 14392 ns 14385 ns 963441
Note that the baseline also improves, as the loop in that implementation is also unrolled. Our optimized version also benefits a bit from loop unrolling, but maybe not as much as we may have liked. Looking into the generated assembly (https://godbolt.org/z/_LJC5f), it seems like gcc might have gone a little far with 8 unrolls. For my setup, I can in fact do a little better with just one simple manual unroll. So drop the flag -funroll-loops again and implement the unrolling with something like this:
for (int i=n-1; i>=0; --i) {
const int col_begin = Lp[i];
const int col_end = Lp[i+1];
const bool is_col_nnz_odd = (col_end - col_begin) & 1;
double xi_temp = x[i];
int j = col_end - 1;
if (is_col_nnz_odd) {
xi_temp -= Lx[j] * x[Li[j]];
--j;
}
for (; j >= col_begin; j -= 2) {
xi_temp -= Lx[j - 0] * x[Li[j - 0]] +
Lx[j - 1] * x[Li[j - 1]];
}
x[i] = xi_temp;
}
With that I measure:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2728 ns 2729 ns 5090909
benchBacksolveBaseline 17451 ns 17449 ns 822018
benchBacksolveOptimized 13440 ns 13443 ns 1018182
Other algorithms
All of these versions still use the same simple implementation of the backward solve on the sparse matrix structure. Inherently, operating on sparse matrix structures like these can have significant problems with memory traffic. At least for matrix factorizations, there are more sophisticated methods, that operate on dense submatrices that are assembled from the sparse structure. Examples are supernodal and multifrontal methods. I am a bit fuzzy on this, but I think that such methods will also apply this idea to layout and use dense matrix operations for lower triangular backwards solves (for example for Cholesky-type factorizations). So it might be worth to look into those kind of methods, if you are not forced to stick to the simple method that works on the sparse structure directly. See for example this survey by Davis.
You might shave a few cycles by using unsigned instead of int for the index types, which must be >= 0 anyway:
void backsolve(const unsigned * __restrict__ Lp,
const unsigned * __restrict__ Li,
const double * __restrict__ Lx,
const unsigned n,
double * __restrict__ x) {
for (unsigned i = n; i-- > 0; ) {
for (unsigned j = Lp[i]; j < Lp[i + 1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Compiling with Godbolt's compiler explorer shows slightly different code for the innerloop, potentially making better use of the CPU pipeline. I cannot test, but you could try.
Here is the generated code for the inner loop:
.L8:
mov rax, rcx
.L5:
mov ecx, DWORD PTR [r10+rax*4]
vmovsd xmm1, QWORD PTR [r11+rax*8]
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
lea rcx, [rax+1]
vmovsd QWORD PTR [r9], xmm0
cmp rdi, rax
jne .L8

why is it faster to print number in binary using arithmetic instead of _bittest

The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
the inner loop is compile to the instructions bt and setb
The second code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.
Why three cpu instructions run faster then two?
Not answer (Bo did), but the second inner loop version can be simplified a bit:
long numCopy = num;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = numCopy & 1;
numCopy >>= 1;
}
Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.
(I'm assuming 32b target, as you convert long into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes <windows.h>, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))
Answer:
Why three cpu instructions run faster then two?
Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.
Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.
If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.
You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...
I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).
_bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!
I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...
(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)
Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)
This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = _bittest(&num, nBit);
//bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient
}
Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer
$LL7#main:
bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput
lea rcx, QWORD PTR [rcx+1] ; 1 uop
setb al ; 1 uop
inc ebx ; 1 uop
mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data)
cmp ebx, 31
jb SHORT $LL7#main ; 1 uop (macro-fused with cmp)
That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.
IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?
Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.
The inner loop compile to and add(as shift left) and sar.
Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)
But the whole loop is this clunky mess:
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
$LL18#main: # MSVC CL19 -Ox
mov ecx, ebx ; 1 uop
lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits
mov eax, r9d ; 1 uop. r9d holds num
inc ebx ; 1 uop
and eax, edx ; 1 uop
# MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not.
add edx, edx ; 1 uop
sar eax, cl ; 3 uops (variable-count shifts suck)
mov BYTE PTR [r8-1], al ; 1 uop (micro-fused)
cmp ebx, 31
jb SHORT $LL18#main ; 1 uop (macro-fused with cmp)
So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.
I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.
Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)
There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc
How to do this fast in asm or with intrinsics
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.
The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.
is there an inverse instruction to the movemask instruction in intel avx2?.
In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.
mov eax, [num]
lea rdi, [rsp + xxx] ; bits[]
.loop:
shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out
setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg
shr eax, 1
setc [rdi+1]
add rdi, 2
cmp end_pointer ; compare against another register instead of a separate counter.
jb .loop
Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.
The difference is that the code _bittest(&num, nBit); uses a pointer to num, which makes the compiler store it in memory. And the memory access makes the code a lot slower.
bits[nBit] = _bittest(&num, nBit);
00007FF6D25110A0 bt dword ptr [num (07FF6D2513034h)],ebx ; <-----
00007FF6D25110A7 lea rcx,[rcx+1]
00007FF6D25110AB setb al
00007FF6D25110AE inc ebx
00007FF6D25110B0 mov byte ptr [rcx-1],al
The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.

C++ inline assembler. How to read a value with the two lines?

Reading an address and reading a value by that address:
int m, n, k;
m = 7;
k = (int)&m;
n = *(int*)k;
Last line is compiled by Visual Studio 2013 to:
mov eax, k
mov eax, [eax]
mov n, eax
when the best variant is:
mov eax,[k]
mov n,eax
But the code below is not working because [k] is interpreted as k:
__asm {
mov eax,[k]
mov n,eax
}
Why? How to fix it?
You are trying to do two indirections in one go. x86 doesn't support this.
When you are doing n = *(int *)k;, you are really reading the value of k and then reading the content of that memory location. Since the content of k is not in a register at this point, it needs to be loaded into a register, and then that register content stored in n.
If you had a PDP-11 or VAX processor, it does indeed have a mov *(k), n (it has opposite direction to Intel assembler, so will move from k on the left to n on the right).
But x86, ARM, MIPS, 29K, 68000, and most other processors don't support this addressing moded

Compile error with embedded assembler

I don't understand why this code
#include <iostream>
using namespace std;
int main(){
int result=0;
_asm{
mov eax,3;
MUL eax,3;
mov result,eax;
}
cout<<result<<endl;
return 0;
}
shows the following error.
1>c:\users\david\documents\visual studio 2010\projects\assembler_instructions\assembler_instructions.cpp(11): error C2414: illegal number of operands
Everything seems fine, and yet why do I get this compiler error?
According to this page, the mul instruction only takes a single argument:
mul arg
This multiplies "arg" by the value of corresponding byte-length in the A register, see table below:
operand size 1 byte 2 bytes 4 bytes
other operand AL AX EAX
higher part of result stored in: AH DX EDX
lower part of result stored in: AL AX EAX
Thus following the notes as per Justin's link:
#include <iostream>
int main()
{
int result=0;
_asm{
mov eax, 3;
mov ebx, 4;
mul ebx;
mov result,eax;
}
std::cout << result << std::endl;
return 0;
}
Use:
imul eax, 3;
or:
imul eax, eax, 3;
That way you don't need to worry about edx -register being clobbered. It's "signed integer multiply". You seem to have 'int' -result so it shouldn't matter whether you use mul or imul.
Sometimes I've gotten errors from not having edx register zeroed when dividing or multiplying. CPU was Intel core2 quad Q9550
There's numbingly overengineered but correct intel instruction reference manuals you can read. Though intel broke its websites while ago. You could try find same reference manuals from AMD sites though.
Update: I found the manual: http://www.intel.com/design/pentiumii/manuals/243191.htm
I don't know when they are going to again break their sites, so you really always need to search it up.
Update2: ARGHL! those are from year 1999.. well most details are unfortunately the same.
You should download the Intel architecture manuals.
http://www.intel.com/products/processor/manuals/
For your purpose, volume 2 is going to help you the most.
As of access in July 2010, they are current.