C++ inline assembler. How to read a value with the two lines? - c++

Reading an address and reading a value by that address:
int m, n, k;
m = 7;
k = (int)&m;
n = *(int*)k;
Last line is compiled by Visual Studio 2013 to:
mov eax, k
mov eax, [eax]
mov n, eax
when the best variant is:
mov eax,[k]
mov n,eax
But the code below is not working because [k] is interpreted as k:
__asm {
mov eax,[k]
mov n,eax
}
Why? How to fix it?

You are trying to do two indirections in one go. x86 doesn't support this.
When you are doing n = *(int *)k;, you are really reading the value of k and then reading the content of that memory location. Since the content of k is not in a register at this point, it needs to be loaded into a register, and then that register content stored in n.
If you had a PDP-11 or VAX processor, it does indeed have a mov *(k), n (it has opposite direction to Intel assembler, so will move from k on the left to n on the right).
But x86, ARM, MIPS, 29K, 68000, and most other processors don't support this addressing moded

Related

Translating C++ x86 Inline assembly code to C++

I've been struggling trying to convert this assembly code to C++ code.
It's a function from an old game that takes pixel data Stmp, and I believe it places it to destination void* dest
void Function(int x, int y, int yl, void* Stmp, void* dest)
{
unsigned long size = 1280 * 2;
unsigned long j = yl;
void* Dtmp = (void*)((char*)dest + y * size + (x * 2));
_asm
{
push es;
push ds;
pop es;
mov edx,Dtmp;
mov esi,Stmp;
mov ebx,j;
xor eax,eax;
xor ecx,ecx;
loop_1:
or bx,bx;
jz exit_1;
mov edi,edx;
loop_2:
cmp word ptr[esi],0xffff;
jz exit_2;
mov ax,[esi];
add edi,eax;
mov cx,[esi+2];
add esi,4;
shr ecx,2;
jnc Next2;
movsw;
Next2:
rep movsd;
jmp loop_2;
exit_2:
add esi,2;
add edx,size;
dec bx;
jmp loop_1;
exit_1:
pop es;
};
}
That's where I've gotten as far to: (Not sure if it's even correct)
while (j > 0)
{
if (*stmp != 0xffff)
{
}
++stmp;
dtmp += size;
--j;
}
Any help is greatly appreciated. Thank you.
It saves / restores ES around setting it equal to DS so rep movsd will use the same addresses for load and store. That instruction is basically memcpy(edi, esi, ecx) but incrementing the pointers in EDI and ESI (by 4 * ecx). https://www.felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq
In a flat memory model, you can totally ignore that. This code looks like it might have been written to run in 16-bit unreal mode, or possibly even real mode, hence the use of 16-bit registers all over the place.
Look like it's loading some kind of records that tell it how many bytes to copy, and reading until the end of the record, at which point it looks for the next record there. There's an outer loop around that, looping through records.
The records look like this I think:
struct sprite_line {
uint16_t skip_dstbytes, src_bytes;
uint16_t src_data[]; // flexible array member, actual size unlimited but assumed to be a multiple of 2.
};
The inner loop is this:
;; char *dstp; // in EDI
;; struct spriteline *p // in ESI
loop_2:
cmp word ptr[esi],0xffff ; while( p->skip_dstbytes != (uint16_t)-1 ) {
jz exit_2;
mov ax,[esi]; ; EAX was xor-zeroed earlier; some old CPUs maybe had slow movzx loads
add edi,eax; ; dstp += p->skip_dstbytes;
mov cx,[esi+2]; ; bytelen = p->src_len;
add esi,4; ; p->data
shr ecx,2; ; length in dwords = bytelen >> 2
jnc Next2;
movsw; ; one 16-bit (word) copy if bytelen >> 1 is odd, i.e. if last bit shifted out was a 1.
; The first bit shifted out isn't checked, so size is assumed to be a multiple of 2.
Next2:
rep movsd; ; copy in 4-byte chunks
Old CPUs (before IvyBridge) had rep movsd faster than rep movsb, otherwise this code could just have done that.
or bx,bx;
jz exit_1;
That's an obsolete idiom that comes from 8080 for test bx,bx / jnz, i.e. jump if BX was zero. So it's a while( bx != 0 ) {} loop. With dec bx in it. It's an inefficient way to write a while (--bx) loop; a compiler would put a dec/jnz .top_of_loop at the bottom, with a test once outside the loop in case it needs to run zero times. Why are loops always compiled into "do...while" style (tail jump)?
Some people would say that's what a while loop looks like in asm, if they're picturing totally naive translation from C to asm.

Is there a branchless method to quickly find the min/max of two double-precision floating-point values?

I have two doubles, a and b, which are both in [0,1]. I want the min/max of a and b without branching for performance reasons.
Given that a and b are both positive, and below 1, is there an efficient way of getting the min/max of the two? Ideally, I want no branching.
Yes, there is a way to calculate the maximum or minimum of two doubles without any branches. The C++ code to do so looks like this:
#include <algorithm>
double FindMinimum(double a, double b)
{
return std::min(a, b);
}
double FindMaximum(double a, double b)
{
return std::max(a, b);
}
I bet you've seen this before. Lest you don't believe that this is branchless, check out the disassembly:
FindMinimum(double, double):
minsd xmm1, xmm0
movapd xmm0, xmm1
ret
FindMaximum(double, double):
maxsd xmm1, xmm0
movapd xmm0, xmm1
ret
That's what you get from all popular compilers targeting x86. The SSE2 instruction set is used, specifically the minsd/maxsd instructions, which branchlessly evaluate the minimum/maximum value of two double-precision floating-point values.
All 64-bit x86 processors support SSE2; it is required by the AMD64 extensions. Even most x86 processors without 64-bit support SSE2. It was released in 2000. You'd have to go back a long way to find a processor that didn't support SSE2. But what about if you did? Well, even there, you get branchless code on most popular compilers:
FindMinimum(double, double):
fld QWORD PTR [esp + 12]
fld QWORD PTR [esp + 4]
fucomi st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
FindMaximum(double, double):
fld QWORD PTR [esp + 4]
fld QWORD PTR [esp + 12]
fucomi st(1)
fxch st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
The fucomi instruction performs a comparison, setting flags, and then the fcmovnbe instruction performs a conditional move, based on the value of those flags. This is all completely branchless, and relies on instructions introduced to the x86 ISA with the Pentium Pro back in 1995, supported on all x86 chips since the Pentium II.
The only compiler that won't generate branchless code here is MSVC, because it doesn't take advantage of the FCMOVxx instruction. Instead, you get:
double FindMinimum(double, double) PROC
fld QWORD PTR [a]
fld QWORD PTR [b]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(1) ; return "b"
ret
Alt:
fstp st(0) ; return "a"
ret
double FindMinimum(double, double) ENDP
double FindMaximum(double, double) PROC
fld QWORD PTR [b]
fld QWORD PTR [a]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(0) ; return "b"
ret
Alt:
fstp st(1) ; return "a"
ret
double FindMaximum(double, double) ENDP
Notice the branching JP instruction (jump if parity bit set). The FCOM instruction is used to do the comparison, which is part of the base x87 FPU instruction set. Unfortunately, this sets flags in the FPU status word, so in order to branch on those flags, they need to be extracted. That's the purpose of the FNSTSW instruction, which stores the x87 FPU status word to the general-purpose AX register (it could also store to memory, but…why?). The code then TESTs the appropriate bit, and branches accordingly to ensure that the correct value is returned. In addition to the branch, retrieving the FPU status word will also be relatively slow. This is why the Pentium Pro introduced the FCOM instructions.
However, it is unlikely that you would be able to improve upon the speed of any of this code by using bit-twiddling operations to determine min/max. There are two basic reasons:
The only compiler generating inefficient code is MSVC, and there's no good way to force it to generate the instructions you want it to. Although inline assembly is supported in MSVC for 32-bit x86 targets, it is a fool's errand when seeking performance improvements. I'll also quote myself:
Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.
In order to get access to the raw bits of a floating-point value, you'd have to do a domain transition, from floating-point to integer, and then back to floating-point. That's slow, especially without SSE2, because the only way to get a value from the x87 FPU to the general-purpose integer registers in the ALU is indirectly via memory.
If you wanted to pursue this strategy anyway—say, to benchmark it—you could take advantage of the fact that floating-point values are lexicographically ordered in terms of their IEEE 754 representations, except for the sign bit. So, since you are assuming that both values are positive:
FindMinimumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [a]
mov rdx, QWORD PTR [b]
sub rax, rdx ; subtract bitwise representation of the two values
shr rax, 63 ; isolate the sign bit to see if the result was negative
ret
FindMaximumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [b] ; \ reverse order of parameters
mov rdx, QWORD PTR [a] ; / for the SUB operation
sub rax, rdx
shr rax, 63
ret
Or, to avoid inline assembly:
bool FindMinimumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((aBits - bBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
bool FindMaximumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((bBits - aBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
Note that there are severe caveats to this implementation. In particular, it will break if the two floating-point values have different signs, or if both values are negative. If both values are negative, then the code can be modified to flip their signs, do the comparison, and then return the opposite value. To handle the case where the two values have different signs, code can be added to check the sign bit.
// ...
// Enforce two's-complement lexicographic ordering.
if (aBits < 0)
{
aBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - aBits);
}
if (bBits < 0)
{
bBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - bBits);
}
// ...
Dealing with negative zero will also be a problem. IEEE 754 says that +0.0 is equal to −0.0, so your comparison function will have to decide if it wants to treat these values as different, or add special code to the comparison routines that ensures negative and positive zero are treated as equivalent.
Adding all of this special-case code will certainly reduce performance to the point that we will break even with a naïve floating-point comparison, and will very likely end up being slower.

Saturate short (int16) in C++

I am optimizing bottleneck code:
int sum = ........
sum = (sum >> _bitShift);
if (sum > 32000)
sum = 32000; //if we get an overflow, saturate output
else if (sum < -32000)
sum = -32000; //if we get an underflow, saturate output
short result = static_cast<short>(sum);
I would like to write the saturation condition as one "if condition" or even better with no "if condition" to make this code faster. I don't need saturation exactly at value 32000, any similar value like 32768 is acceptable.
According this page, there is a saturation instruction in ARM. Is there anything similar in x86/x64?
I'm not at all convinced that attempting to eliminate the if statement(s) is likely to do any real good. A quick check indicates that given this code:
int clamp(int x) {
if (x < -32768)
x = -32768;
else if (x > 32767)
x = 32767;
return x;
}
...both gcc and Clang produce branch-free results like this:
clamp(int):
cmp edi, 32767
mov eax, 32767
cmovg edi, eax
mov eax, -32768
cmp edi, -32768
cmovge eax, edi
ret
You can do something like x = std::min(std::max(x, -32768), 32767);, but this produces the same sequence, and the source seems less readable, at least to me.
You can do considerably better than this if you use Intel's vector instructions, but probably only if you're willing to put a fair amount of work into it--in particular, you'll probably need to operate on an entire (small) vector of values simultaneously to accomplish much this way. If you do go that way, you usually want to take a somewhat different approach to the task than you seem to be taking right now. Right now, you're apparently depending on int being a 32-bit type, so you're doing the arithmetic on a 32-bit type, then afterwards truncating it back down to a (saturated) 16-bit value.
With something like AVX, you'd typically want to use an instruction like _mm256_adds_epi16 to take a vector of 16 values (of 16-bits apiece), and do a saturating addition on all of them at once (or, likewise, _mm256_subs_epi16 to do saturating subtraction).
Since you're writing C++, what I've given above are the names of the compiler intrinsics used in most current compilers (gcc, icc, clang, msvc) for x86 processors. If you're writing assembly language directly, the instructions would be vpaddsw and vpsubsw respectively.
If you can count on a really current processor (one that supports AVX 512 instructions) you can use them instead to operate on a vector of 32 16-bit values simultaneously.
Are you sure you can beat the compiler at this?
Here's x64 retail with max size optimizations enabled. Visual Studio v15.7.5.
ecx contains the intial value at the start of this block. eax is filled with the saturated value when it is done.
return (x > 32767) ? 32767 : ((x < -32768) ? -32768 : x);
mov edx,0FFFF8000h
movzx eax,cx
cmp ecx,edx
cmovl eax,edx
mov edx,7FFFh
cmp ecx,edx
movzx eax,ax
cmovg eax,edx

why is it faster to print number in binary using arithmetic instead of _bittest

The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
the inner loop is compile to the instructions bt and setb
The second code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.
Why three cpu instructions run faster then two?
Not answer (Bo did), but the second inner loop version can be simplified a bit:
long numCopy = num;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = numCopy & 1;
numCopy >>= 1;
}
Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.
(I'm assuming 32b target, as you convert long into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes <windows.h>, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))
Answer:
Why three cpu instructions run faster then two?
Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.
Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.
If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.
You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...
I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).
_bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!
I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...
(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)
Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)
This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = _bittest(&num, nBit);
//bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient
}
Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer
$LL7#main:
bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput
lea rcx, QWORD PTR [rcx+1] ; 1 uop
setb al ; 1 uop
inc ebx ; 1 uop
mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data)
cmp ebx, 31
jb SHORT $LL7#main ; 1 uop (macro-fused with cmp)
That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.
IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?
Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.
The inner loop compile to and add(as shift left) and sar.
Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)
But the whole loop is this clunky mess:
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
$LL18#main: # MSVC CL19 -Ox
mov ecx, ebx ; 1 uop
lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits
mov eax, r9d ; 1 uop. r9d holds num
inc ebx ; 1 uop
and eax, edx ; 1 uop
# MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not.
add edx, edx ; 1 uop
sar eax, cl ; 3 uops (variable-count shifts suck)
mov BYTE PTR [r8-1], al ; 1 uop (micro-fused)
cmp ebx, 31
jb SHORT $LL18#main ; 1 uop (macro-fused with cmp)
So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.
I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.
Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)
There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc
How to do this fast in asm or with intrinsics
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.
The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.
is there an inverse instruction to the movemask instruction in intel avx2?.
In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.
mov eax, [num]
lea rdi, [rsp + xxx] ; bits[]
.loop:
shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out
setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg
shr eax, 1
setc [rdi+1]
add rdi, 2
cmp end_pointer ; compare against another register instead of a separate counter.
jb .loop
Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.
The difference is that the code _bittest(&num, nBit); uses a pointer to num, which makes the compiler store it in memory. And the memory access makes the code a lot slower.
bits[nBit] = _bittest(&num, nBit);
00007FF6D25110A0 bt dword ptr [num (07FF6D2513034h)],ebx ; <-----
00007FF6D25110A7 lea rcx,[rcx+1]
00007FF6D25110AB setb al
00007FF6D25110AE inc ebx
00007FF6D25110B0 mov byte ptr [rcx-1],al
The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.

Compile error with embedded assembler

I don't understand why this code
#include <iostream>
using namespace std;
int main(){
int result=0;
_asm{
mov eax,3;
MUL eax,3;
mov result,eax;
}
cout<<result<<endl;
return 0;
}
shows the following error.
1>c:\users\david\documents\visual studio 2010\projects\assembler_instructions\assembler_instructions.cpp(11): error C2414: illegal number of operands
Everything seems fine, and yet why do I get this compiler error?
According to this page, the mul instruction only takes a single argument:
mul arg
This multiplies "arg" by the value of corresponding byte-length in the A register, see table below:
operand size 1 byte 2 bytes 4 bytes
other operand AL AX EAX
higher part of result stored in: AH DX EDX
lower part of result stored in: AL AX EAX
Thus following the notes as per Justin's link:
#include <iostream>
int main()
{
int result=0;
_asm{
mov eax, 3;
mov ebx, 4;
mul ebx;
mov result,eax;
}
std::cout << result << std::endl;
return 0;
}
Use:
imul eax, 3;
or:
imul eax, eax, 3;
That way you don't need to worry about edx -register being clobbered. It's "signed integer multiply". You seem to have 'int' -result so it shouldn't matter whether you use mul or imul.
Sometimes I've gotten errors from not having edx register zeroed when dividing or multiplying. CPU was Intel core2 quad Q9550
There's numbingly overengineered but correct intel instruction reference manuals you can read. Though intel broke its websites while ago. You could try find same reference manuals from AMD sites though.
Update: I found the manual: http://www.intel.com/design/pentiumii/manuals/243191.htm
I don't know when they are going to again break their sites, so you really always need to search it up.
Update2: ARGHL! those are from year 1999.. well most details are unfortunately the same.
You should download the Intel architecture manuals.
http://www.intel.com/products/processor/manuals/
For your purpose, volume 2 is going to help you the most.
As of access in July 2010, they are current.