Inline asm and c array questions - c++

This is a homework. I have 3 arrays, v1={5,4,3,2,1} ,v2={1,2,3,4,5} and v3={2,3,5,1,4}, the assigment is to change the 1 to 6. Of course, any solution like v1[4]=6, in asm or c is forbidden. So this was my code:
First Code
void main(){
int myArray[5]={5,4,3,2,1};
__asm {
mov ecx,0 //using ecx as counter
myLoop:
mov eax, myArray[ecx] //moving the content on myArray in position ecx to eax
cmp eax,1 //comparing eax to 1
je is_one //if its equal jump to label is_one
inc ecx //ecx+1
cmp ecx,5 //since all vectors have size 5, comparing if ecx is equal to 5
jne myLoop //if not, repeat
jmp Done //if true, go to label Done
is_one:
mov myArray[ecx],6 //changing the content in myArray position ecx to 6
inc ecx //ecx+1
cmp ecx,5 // ecx=5?
jne myLoop //no? repeat loop
jmp Done //yes? Done
Done:
}
printArray(myArray);
}
this didn't work, tried many things like mov eax,6 or mov [eax+ecx],6 , nothing worked until I found this solution
Many tries later code
void main(){
int myArray[5]={5,4,3,2,1};
__asm {
mov ecx,0 //using ecx as counter
myLoop:
mov eax, myArray[TYPE myArray*ecx] //I don't understand how this works
cmp eax,1 //comparing eax to 1
je is_one //if its equal jump to label is_one
inc ecx //ecx+1
cmp ecx,5 //since all vectors have size 5, comparing if ecx is equal to 5
jne myLoop //if not, repeat
jmp Done //if true, go to label Done
is_one:
mov myArray[TYPE myArray*ecx],6 //Uhh...
inc ecx //ecx+1
cmp ecx,5 // ecx=5?
jne myLoop //no? repeat loop
jmp Done //yes? Done
Done:
}
printArray(myArray);
}
And that works like a charm. But I don't understand how or why the MOV array[TYPE array * index], value works(besides TYPE returning the size as explained in link), and why not the others.
Also, since I have to do this for 3 arrays, I tried to copy and paste all the code to changingArray(int myArray[]), declared the 3 arrays in the main, and passed them to changingArray, but now is not changing them. Im pretty sure that with vector you dont have to pass with &, I could be wrong. Still, I can't see why it doesn't change them. So...
Final Code
void changingArray(int myArray[]){
__asm {
mov ecx,0 //using ecx as counter
myLoop:
mov eax, myArray[TYPE myArray*ecx] //I don't understand how this works
cmp eax,1 //comparing eax to 1
je is_one //if its equal jump to label is_one
inc ecx //ecx+1
cmp ecx,5 //since all vectors have size 5, comparing if ecx is equal to 5
jne myLoop //if not, repeat
jmp Done //if true, go to label Done
is_one:
mov myArray[TYPE myArray*ecx],6 //Uhh...
inc ecx //ecx+1
cmp ecx,5 // ecx=5?
jne myLoop //no? repeat loop
jmp Done //yes? Done
Done:
}
printArray(myArray);
}
void main(){
//for some odd reason, they arent changing
int v1[5]={5,4,3,2,1};
int v2[5]={1,2,3,4,5};
int v3[5]={2,3,5,1,4};
changingArray(v1);
changingArray(v2);
changingArray(v3);
}
TL:DR section:
Homework of changing the number 1 to 6 in 3 arrays v1={5,4,3,2,1} ,v2={1,2,3,4,5} and v3={2,3,5,1,4}
1-I don't get why the first code doesn't work, but many tries later code works (the MOV array[TYPE array * index], value instruction).
2- Since I need to do this with 3 arrays, I put all the code in changingArray(int myArray[]), and in the main I declared my 3 arrays in main as shown in final code. While many tries code did change the array, this doesnt. Probably I just made a mistake in c and not asm, but I don't see it.
And sorry for bad english, is not my first language.

mov eax, myArray[TYPE myArray*ecx]
Here the address referred to is (base address of myArray) + sizeof(the type of elements of myArray) * ecx. In assembly language the indexing should be done in bytes.

Related

Assembly: loop through a sequence of characters and swap them

My assignment is to Implement a function in assembly that would do the following:
loop through a sequence of characters and swap them such that the end result is the original string in reverse ( 100 points )
Hint: collect the string from user as a C-string then pass it to the assembly function along with the number of characters entered by the user. To find out the number of characters use strlen() function.
i have written both c++ and assembly programs and it works fine for extent: for example if i input 12345 the out put is correctly shown as 54321 , but if go more than 5 characters : the out put starts to be incorrect: for example if i input 123456 the output is :653241. i will greatly appreciate anyone who can point where my mistake is:
.code
_reverse PROC
push ebp
mov ebp,esp ;stack pointer to ebp
mov ebx,[ebp+8] ; address of first array element
mov ecx,[ebp+12] ; the number of elemets in array
mov eax,ebx
mov ebp,0 ;move 0 to base pointer
mov edx,0 ; set data register to 0
mov edi,0
Setup:
mov esi , ecx
shr ecx,1
add ecx,edx
dec esi
reverse:
cmp ebp , ecx
je allDone
mov edx, eax
add eax , edi
add edx , esi
Swap:
mov bl, [edx]
mov bh, [eax]
mov [edx],bh
mov [eax],bl
inc edi
dec esi
cmp edi, esi
je allDone
inc ebp
jmp reverse
allDone:
pop ebp ; pop ebp out of stack
ret ; retunr the value of eax
_reverse ENDP
END
and here is my c++ code:
#include<iostream>
#include <string>
using namespace std;
extern"C"
char reverse(char*, int);
int main()
{
char str[64] = {NULL};
int lenght;
cout << " Please Enter the text you want to reverse:";
cin >> str;
lenght = strlen(str);
reverse(str, lenght);
cout << " the reversed of the input is: " << str << endl;
}
You didn't comment your code, so IDK what exactly you're trying to do, but it looks like you are manually doing the array indexing with MOV / ADD instead of using an addressing mode like [eax + edi].
However, it looks like you're modifying your original value and then using it in a way that would make sense if it was unmodified.
mov edx, eax ; EAX holds a pointer to the start of array, read every iter
add eax , edi ; modify the start of the array!!!
add edx , esi
Swap:
inc edi
dec esi
EAX grows by EDI every step, and EDI increases linearly. So EAX increases geometrically (integral(x * dx) = x^2).
Single-stepping this in a debugger should have found this easily.
BTW, the normal way to do this is to walk one pointer up, one pointer down, and fall out of the loop when they cross. Then you don't need a separate counter, just cmp / ja. (Don't check for JNE or JE, because they can cross each other without ever being equal.)
Overall you the right idea to start at both ends of the string and swap elements until you get to the middle. Implementation is horrible though.
mov ebp,0 ;move 0 to base pointer
This seems to be loop counter (comment is useless or even worse); I guess idea was to swap length/2 elements which is perfectly fine. HINT I'd just compare pointers/indexes and exit once they collide.
mov edx,0 ; set data register to 0
...
add ecx,edx
mov edx, eax
Useless and misleading.
mov edi,0
mov esi , ecx
dec esi
Looks like indexes to start/end of the string. OK. HINT I'd go with pointers to start/end of the string; but indexes work too
cmp ebp , ecx
je allDone
Exit if did length/2 iterations. OK.
mov edx, eax
add eax , edi
add edx , esi
eax and edx point to current symbols to be swapped. Almost OK but this clobbers eax! Each loop iteration after second will use wrong pointers! This is what caused your problem in the first place. This wouldn't have happened if you used pointers instead indexes, or if you'd used offset addressing [eax+edi]/[eax+esi]
...
Swap part is OK
cmp edi, esi
je allDone
Second exit condition, this time comparing for index collision! Generally one exit condition should be enough; several exit conditions usually either superfluous or hint at some flaw in the algorithm. Also equality comparison is not enough - indexes can go from edi<esi to edi>esi during single iteration.

Reverse external array in assembly using swapping method - x86 MASM

I am working on a project where we need to pass an array of type char as a parameter and reverse the array. I feel like I am very close to getting it done, but I am stuck on the actual swapping process.
For my swapping function in my .asm, I used the same method I would in c++ (use an unused register as a temp, then swap the front and the back.) What I am not understanding is how would I go about changing the actual content at that address. I assumed performing the following would "change" the content at the destination address:
mov eax,[edx]
However, this did not work as planned. After I ran a for loop to iterate through the array again, everything stayed the same.
If anyone can point me in the right direction, it would be great. I have provided the code below with as much comments as I could provide.
Also, I am doing all this in a single .asm file; however, my professor wants me to have 3 separate .asm document for each of the following functions: swap, reverse, and getLength. I tried to include the other 2 .asm document in the reverse.asm, but it kept giving me an error.
Assembly Code Starts:
.686
.model flat
.code
_reverse PROC
push ebp
mov ebp,esp ;Have ebp point to esp
mov ebx,[ebp+8] ;Point to beginning of array
mov eax,ebx
mov edx,1
mov ecx,0
mov edi,0
jmp getLength
getLength:
cmp ebp, 0 ;Counter to iterate until needed to stop
je setup
add ecx,1
mov ebp,[ebx+edx]
add edx,1
jmp getLength
setup: ;This is to set up the numbers correctly and get array length divided by 2
mov esi,ecx
mov edx,0
mov eax,ecx
mov ecx,2
div ecx
mov ecx,eax
add ecx,edx ;Set up ecx(Length of string) correctly by adding modulo if odd length string
mov eax,ebx
dec esi
jmp reverse
reverse: ;I started the reverse function by using a counter to iterate through length / 2
cmp edi, ecx
je allDone
mov ebx,eax ;Set ebx to the beginning of array
mov edx,eax ;Set edx to the beginning of array
add ebx,edi ;Move ebx to correct index to perform swap
add edx,esi ;Move edx to the back at the correct index
jmp swap ;Invoke swap function
swap:
mov ebp,ebx ;Move value to temp
mov ebx,[edx] ;Swap the back end value to the front
mov edx,[edx] ;Move temp to back
inc edi ;Increment to move up one index to set up next swap
dec esi ;Decrement to move back one index to set up for next swap
jmp reverse ;Jump back to reverse to setup next index swapping
allDone:
pop ebp
ret
_reverse ENDP
END
C++ Code starts:
#include <iostream>
#include <string>
using namespace std;
extern "C" char reverse(char*);
int main()
{
const int SIZE = 20;
char str1[SIZE] = { NULL };
cout << "Please enter a string: ";
cin >> str1;
cout << "Your string is: ";
for (int i = 0; str1[i] != NULL; i++)
{
cout << str1[i];
}
cout << "." << endl;
reverse(str1);
cout << "Your string in reverse is: ";
for (int i = 0; str1[i] != NULL; i++)
{
cout << str1[i];
}
cout << "." << endl;
system("PAUSE");
return 0;
}
So after many more hours of tinkering and looking around, I was finally able to figure out how to properly copy over a byte. I will post my .asm code below with comments if anybody needs it for future reference.
I was actually moving the content of the current address into a 32 bit registers. After I changed it from mov ebx,[eax] to mov bl,[eax], it copied the value correctly.
I will only post the code that I was having difficulty with so I do not give away the entire project for other students.
ASM Code Below:
swap:
mov bl,[edx] ;Uses bl since we are trying to copy a 1 byte char value
mov bh,[eax] ;Uses bh since we are trying to copy a 1 byte char value
mov [edx],bh ;Passing the value to the end of the array
mov [eax],bl ;Passing the value to the beginning of the array
inc eax ;Moving the array one index forward
dec edx ;Moving the array one index backwards
dec ecx ;Decreasing the counter by one to continue loop as needed
jmp reverse ;Jump back to reverse to check if additional swap is needed
Thanks for everyone that helped.
mov eax,[edx] (assuming intel syntax) places the 32 bits found in memory at address edx into eax. I.e, this code retrieves data from a memory location. If you'd like to write to a mem location, you need to reverse this, i.e mov [edx], eax
After playing with some 16 bit code overnight for sorting, I've the following two functions that may be of use. Obviously, you can't copy/paste them - you'll have to study it. However, you'll notice that it is able to swap items of arbitrary size. Perfect for swapping elements that are structures of some type.
; copies cx bytes from ds:si to es:di
copyBytes:
shr cx, 1
jnc .swapCopy1Loop
movsb
.swapCopy1Loop:
shr cx, 1
jnc .swapCopy2Loop
movsw
.swapCopy2Loop:
rep movsd
ret
; bp+0 bp+2 bp+4
;void swap(void *ptr1, void *ptr2, int dataSizeBytes)
swapElems:
push bp
mov bp, sp
add bp, 4
push di
push si
push es
mov ax, ds
mov es, ax
sub sp, [bp+4] ; allocate dataSizeBytes on the stack, starting at bp-6 - dataSizeBytes
mov di, sp
mov si, [bp+0]
mov cx, [bp+4]
call copyBytes
mov si, [bp+2]
mov di, [bp+0]
mov cx, [bp+4]
call copyBytes
mov si, sp
mov di, [bp+2]
mov cx, [bp+4]
call copyBytes
add sp, [bp+4]
pop es
pop si
pop di
pop bp
ret 2 * 3

Creating loop in x86 assembly and use of arrays? [duplicate]

I am currently learning assembly programming as part of one of my university modules. I have a program written in C++ with inline x86 assembly which takes a string of 6 characters and encrypts them based on the encryption key.
Here's the full program: https://gist.github.com/anonymous/1bb0c3be77566d9b791d
My code fo the encrypt_chars function:
void encrypt_chars (int length, char EKey)
{ char temp_char; // char temporary store
for (int i = 0; i < length; i++) // encrypt characters one at a time
{
temp_char = OChars [i]; // temp_char now contains the address values of the individual character
__asm
{
push eax // Save values contained within register to stack
push ecx
movzx ecx, temp_char
push ecx // Push argument #2
lea eax, EKey
push eax // Push argument #1
call encrypt
add esp, 8 // Clean parameters of stack
mov temp_char, al // Move the temp character into a register
pop ecx
pop eax
}
EChars [i] = temp_char; // Store encrypted char in the encrypted chars array
}
return;
// Inputs: register EAX = 32-bit address of Ekey,
// ECX = the character to be encrypted (in the low 8-bit field, CL).
// Output: register EAX = the encrypted value of the source character (in the low 8-bit field, AL).
__asm
{
encrypt:
push ebp // Set stack
mov ebp, esp // Set up the base pointer
mov eax, [ebp + 8] // Move value of parameter 1 into EAX
mov ecx, [ebp + 12] // Move value of parameter 2 into ECX
push edi // Used for string and memory array copying
push ecx // Loop counter for pushing character onto stack
not byte ptr[eax] // Negation
add byte ptr[eax], 0x04 // Adds hex 4 to EKey
movzx edi, byte ptr[eax] // Moves value of EKey into EDI using zeroes
pop eax // Pop the character value from stack
xor eax, edi // XOR character to give encrypted value of source
pop edi // Pop original address of EDI from the stack
rol al, 1 // Rotates the encrypted value of source by 1 bit (left)
rol al, 1 // Rotates the encrypted value of source by 1 bit (left) again
add al, 0x04 // Adds hex 4 to encrypted value of source
mov esp, ebp // Deallocate values
pop ebp // Restore the base pointer
ret
}
//--- End of Assembly code
}
My questions are:
What is the best/ most efficient way to convert this for loop into assembly?
Is there a way to remove the call for encrypt and place the code directly in its place?
How can I optimise/minimise the use of registers and instructions to make the code smaller and potentially faster?
Is there a way for me to convert the OChars and EChars arrays into assembly?
If possible, would you be able to provide me with an explanation of how the solution works as I am eager to learn.
I can't help with optimization or the cryptography but i can show you a way to go about making a loop, if you look at the loop in this function:
void f()
{
int a, b ;
for(a = 10, b = 1; a != 0; --a)
{
b = b << 2 ;
}
}
The loop is essentially:
for(/*initialize*/; /*condition*/; /*modify*/)
{
// run code
}
So the function in assembly would look something along these lines:
_f:
push ebp
mov ebp, esp
sub esp, 8 ; int a,b
initialize: ; for
mov dword ptr [ebp-4], 10 ; a = 10,
mov dword ptr [ebp-8], 1 ; b = 1
mov eax, [ebp-4]
condition:
test eax, eax ; tests if a == 0
je exit
runCode:
mov eax, [ebp-8]
shl eax, 2 ; b = b << 2
mov dword ptr [ebp-8], eax
modify:
mov eax, [ebp-4]
sub eax, 1 ; --a
mov dword ptr [ebp-4], eax
jmp condition
exit:
mov esp, ebp
pop ebp
ret
Plus I show in the source how you make local variables;
subtract the space from the stack pointer.
and access them through the base pointer.
I tried to make the source as generic intel x86 assembly syntax as i could so my apologies if anything needs changing for your specific environment i was more aiming to give a general idea about how to construct a loop in assembly then giving you something you can copy, paste and run.
I would suggest to look into assembly code which is generated by compiler. You can change and optimize it later.
How do you get assembler output from C/C++ source in gcc?

Inline Assembly Code Does Not Compile in Visual C++ 2010 Express

I have some assembly merge sort code that I obtained from Github and I am trying to embed it into Inline Assembly in C++, but it won't compile and keeps returning these errors:
1>c:\users\mayank\desktop\assembly\assembly\main.cpp(147): error
C2415: improper operand type
The code that I am attempting to run is this:
#include <iostream>
#include <cmath>
#include <stdio.h>
using namespace std;
const int ARRAYSIZE = 30;
int main()
{
int arr[ARRAYSIZE];
int temp_arr[ARRAYSIZE];
int number;
for(int x = 0; x < ARRAYSIZE; x++)
{
number = (rand() % 99) + 1;
arr[x] = number;
}
/*
READ_ARR_LEN:
__asm
{
// Read the length of the array
//GetLInt [30] // Size of input array
//PutLInt [30]
}
GET_ARRAY:
__asm
{
//intel_syntax
// Get values in arr from the user
mov EAX, arr
mov ECX, ARR_LEN
call Read_Arr
// Run Merge Sort on the array
mov EAX, arr
mov EBX, temp_arr
mov ECX, ARR_LEN
call Merge_Sort
// EXIT
};;
*/
Merge_Sort:
__asm
{
// EAX - Array start
// ECX - array length
// Arrays of size 0 or 1 are already sorted
cmp ARRAYSIZE, 2
jl Trivial_Merge_Sort
// Merge_Sort (first half)
// Length of the first half
// ECX /= 2
push ARRAYSIZE
shr ARRAYSIZE, 1
call Merge_Sort
pop ARRAYSIZE
// Merge_Sort (second half)
push arr
push EBX
push ARRAYSIZE
// Length of the second half
// ECX = ECX - ECX/2
mov EDX, ARRAYSIZE
shr EDX, 1
sub ARRAYSIZE, EDX
imul EDX, 4
// Start index of the second half
// EAX = EAX + (ECX/2) * 4
add arr, EDX
push EDX
call Merge_Sort
pop EDX
pop ARRAYSIZE
pop EBX
pop arr
pushad
// Merge (first half, second half)
// Length of first half = ECX/2
// Length of second half = ECX - ECX/2
mov EDX, ECX
shr ECX, 1
sub EDX, ECX
// Start of second half = EAX + (ECX/2) * 4
mov EBX, EAX
mov EDI, ECX
imul EDI, 4
add EBX, EDI
// Index of temp array = 0
sub EDI, EDI
call Merge
popad
// Copy back the merged array from temp_arr to arr
call Merge_Copy_Back_Temp
ret
};
Trivial_Merge_Sort:
__asm
{
// In case of arrays of length 0 or 1
ret
};
Merge:
__asm
{
// Merge two arrays contents.
// The final merged array will be in temp_arr
// Merging is done recursively.
// Arguments:
// EAX - First array's start
// EBX - Second array's start
// ECX - Length of first array
// EDX - Length of second array
// EDI - Index in temp array
pushad
// Handle the cases where one array is empty
cmp ARRAYSIZE, 0
jz First_Array_Over
cmp EDX, 0
jz Second_Array_Over
// Compare first elements of both the arrays
push ARRAYSIZE
push EDI
mov ARRAYSIZE, [arr]
mov EDI, [ARRAYSIZE]
cmp ARRAYSIZE, EDI
pop EDI
pop ARRAYSIZE
// Pick which ever is the least and update that array
jl Update_First_Array
jmp Update_Second_Array
};
Update_First_Array:
__asm
{
// min_elem = min (first elements of first array and second array)
// Put min_elem into the temp array
push dword ptr [EAX]
pop dword ptr [temp_arr + EDI * 4]
add EAX, 4
dec ECX
inc EDI
// Recursively call Merge on the updated array and the
// other array
call Merge
popad
ret
};
Update_Second_Array:
__asm
{
// min_elem = min (first elements of first array and second array)
// Put min_elem into the temp array
push dword ptr [EBX]
pop dword ptr [temp_arr + EDI * 4]
add EBX, 4
dec EDX
inc EDI
// Recursively call Merge on the updated array and the
// other array
call Merge
popad
ret
};
Merge_Copy_Back_Temp:
__asm
{
// Copy back the temp array into original array
// Arguments:
// EAX - original array address
// ECX - original array length
pushad
// For copying back, the destination array is EAX
mov EBX, EAX
// Now, the source array is temp_arr
mov EAX, temp_arr
call Copy_Array
popad
ret
};
Trivial_Merge:
__asm
{
// Note: One array is empty means no need to merge.
popad
ret
};
First_Array_Over:
__asm
{
// Copy the rest of the second array to the temp arr
// because the first array is empty
pushad
mov EAX, EBX
mov ECX, EDX
mov EBX, temp_arr
imul EDI, 4
add EBX, EDI
call Copy_Array
popad
popad
ret
};
Second_Array_Over:
__asm
{
// Copy the rest of the first array to the temp arr
// because the second array is empty
pushad
mov EBX, temp_arr
imul EDI, 4
add EBX, EDI
call Copy_Array
popad
popad
ret
};
Copy_Array:
__asm
{
// Copy array to destination array
// EAX - Array start
// EBX - Destination array
// ECX - Array length
// Trivial case
cmp ECX, 0
jz Copy_Empty_Array
push ECX
sub EDI, EDI
};
copy_loop:
__asm
{
// Copy each element
push dword ptr [EAX + EDI * 4]
pop dword ptr [EBX + EDI * 4]
inc EDI
loop copy_loop
pop ECX
ret
};
Copy_Empty_Array:
__asm
{
ret
};
Read_Arr:
__asm
{
// EAX - array start
// ECX - array length
mov ESI, EAX
sub EDI, EDI
};
loop1:
__asm
{
// Read each element
lea eax,[esi+edx*4]
inc EDI
loop loop1
ret
};
return 0;
}
(Update: In the original code posted in the question there were attempts to address memory as DWORD [address], which is incompatible with the syntax used by Visual C++'s inline assembler as I point out in my answer below.)
Visual C++ uses MASM syntax for its inline assembly, so you need to use DWORD PTR instead of just DWORD. That's what's causing these compilation errors.
See e.g. this table from The Art of Assembly.
This looks like the code is from this github repository.
In that code, GetLInt is actually a NASM macro which is included in an external macro definition file and calls a function proc_GetLInt, which in turn is provided in an object file io.o, the source code is not there. The problem is therefore simply that
you didn't realize that GetLint is external code that you're missing
even if you took all files from that repository, it would work because NASM macros don't work directly in VC++ inline assembly
even if you fixed the macro problem, you still don't have the GetLInt function because it is provided as a linux object file only, you'd have to write it yourself
How do you fix this?
That code was meant to provide a self-contained assembler program that handles all input/output on its own. Since you're inlining it in VC++, you have much more powerful I/O handling at your hands already. Use those instead, i.e. make sure that the values you want to sort are already in arr before the inline assembly starts.
Then, looking at the code: Merge_Sort expects the start of your array in EAX, and its length in ECX. You can get both from your C++ code. When you do that, you no longer need the READ_ARR_LEN and GET_ARRAY blocks from the assembler code.
I am rather reluctant to reproduce parts of the code with modifications as I cannot find a licence file on github that would say I may do so. Let me try to describe: You need to manually move the pointer to arr into EAX and the content of ARRAYSIZE into EBX at the very start of the assembler routine. (*) As I can see, you have already taken care of filling the array with numbers, so there's nothing you need to do there.
Then you need to remove all unnecessary assembler functions and calls to them. You also should either condense all your separate __asm blocks into one or use external variables to conserve and restore registers between blocks (or read the tutorial here, but just using one block works and is less hassle).
Finally, you have to be careful with the stack frames: every call has to have a matching ret. This is easy to stumble over as the merge sort procedure is recursive.
(*) Careful with VC++'s way of treating variables in inside asm blocks, be sure to actually use pointers when you need them.
So all in all, porting this to VC++ is not a trivial task.

can you suggest me better solutions for this in C++ inline-assembly?

i am learning assembly and i started experiments on SSE and MMX registers within the Digital-Mars C++ compiler (intel sytanx more easily readable). I have finished a program that takes var_1 as a value and converts it to the var_2 number system(this is in 8 bit for now. will expand it to 32 64 128 later) . Program does this by two ways:
__asm inlining
Usual C++ way of %(modulo) operator.
Question: Can you tell me more efficient way to use xmm0-7 and mm0-7 registers and can you tell me how to exchange exact bytes of them with al,ah... 8 bit registers?
Usual %(modulo) operator in the C++ usual way is very slow in comparison with __asm on my computer(pentium-m centrino 2.0GHz).
If you can tell me how to get rid of division instruction in __asmm, it will be even faster.
When i run the program it gives me:
(for the values: var_1=17,var_2=2,all loops are 200M times)
17 is 10001 in number system 2
__asm(clock)...........: 7250 <------too bad. it is 8-bit calc.
C++(clock).............: 12250 <------not very slow(var_2 is a power of 2)
(for the values: var_1=33,var_2=7,all loops are 200M times)
33 is 45 in number system 7
__asm(clock)..........: 2875 <-------not good. it is 8-bit calc.
C++(clock)............: 6328 <----------------really slow(var_2 is not a power of 2)
The second C++ code(the one with % operator): /////////////////////////////////////////////////////////
t1=clock();//reference time
for(int i=0;i<200000000;i++)
{
y=x;
counter=0;
while(y>g)
{
var_3[counter]=y%g;
y/=g;
counter++;
}
var_3[counter]=y%g;
}
t2=clock();//final time
_asm code:////////////////////////////////////////////////////////////////////////////////////////////////////////////
__asm // i love assembly in some parts of C++
{
pushf //here does register backup
push eax
push ebx
push ecx
push edx
push edi
mov eax,0h //this will be outer loop counter init to zero
//init of medium-big registers to zero
movd xmm0,eax //cannot set to immediate constant: xmm0=outer loop counter
shufps xmm0,xmm0,0h //this makes all bits zero
movd xmm1,eax
movd xmm2,eax
shufps xmm1,xmm1,0h
shufps xmm2,xmm2,0h
movd xmm2,eax
shufps xmm3,xmm3,0h//could have made pxor xmm3,xmm3(single instruction)
//init complete(xmm0,xmm1,xmm2,xmm3 are zero)
movd xmm1,[var_1] //storing variable_1 to register
movd xmm2,[var_2] //storing var_2 to register
lea ebx,var_3 //calculate var_3 address
movd xmm3,ebx //storing var_3's address to register
for_loop:
mov eax,0h
//this line is index-init to zero(digit array index)
movd edx,xmm2
mov cl,dl //this is the var_1 stored in cl
movd edx,xmm1
mov al,dl //this is the var_2 stored in al
mov edx,0h
dng:
mov ah,00h //preparation for a 8-bit division
div cl //divide
movd ebx,xmm3 //get var_3 address
add ebx,edx //i couldnt find a way to multiply with 4
add ebx,edx //so i added 4 times ^^
add ebx,edx //add
add ebx,edx //last adding
//below, mov [ebx],ah is the only memory accessing instruction
mov [ebx],ah //(8 bit)this line is equivalent to var_3[i]=remainder
inc edx //i++;
cmp al,00h //is division zero?
jne dng //if no, loop again
//here edi register has the number of digits
movd eax,xmm0 //get the outer loop counter from medium-big register
add eax,01h //j++;
movd xmm0,eax //store the new counter to medium-big register
cmp eax,0BEBC200h //is j<(200,000,000) ?
jb for_loop //if yes, go loop again
mov [var_3_size],edx //now we have number of digits too!
//here does registers revert back to old values
pop edi
pop edx
pop ecx
pop ebx
pop eax
popf
}
Whole code://///////////////////////////////////////////////////////////////////////////////////////
#include <iostream.h>
#include <cmath>
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main()
{
srand(time(0));
clock_t t1=clock();
clock_t t2=clock();
int var_1=17; //number itself
int var_2=2; //number system
int var_3[100]; //digits to be showed(maximum 100 as seen )
int var_3_size=0;//asm block will decide what will the number of digits be
for(int i=0;i<100;i++)
{
var_3[i]=0; //here we initialize digits to zeroes
}
t1=clock();//reference time to take
__asm // i love assembly in some parts of C++
{
pushf //here does register backup
push eax
push ebx
push ecx
push edx
push edi
mov eax,0h //this will be outer loop counter init to zero
//init of medium-big registers to zero
movd xmm0,eax //cannot set to immediate constant: xmm0=outer loop counter
shufps xmm0,xmm0,0h //this makes all bits zero
movd xmm1,eax
movd xmm2,eax
shufps xmm1,xmm1,0h
shufps xmm2,xmm2,0h
movd xmm2,eax
shufps xmm3,xmm3,0h
//init complete(xmm0,xmm1,xmm2,xmm3 are zero)
movd xmm1,[var_1] //storing variable_1 to register
movd xmm2,[var_2] //storing var_2 to register
lea ebx,var_3 //calculate var_3 address
movd xmm3,ebx //storing var_3's address to register
for_loop:
mov eax,0h
//this line is index-init to zero(digit array index)
movd edx,xmm2
mov cl,dl //this is the var_1 stored in cl
movd edx,xmm1
mov al,dl //this is the var_2 stored in al
mov edx,0h
dng:
mov ah,00h //preparation for a 8-bit division
div cl //divide
movd ebx,xmm3 //get var_3 address
add ebx,edx //i couldnt find a way to multiply with 4
add ebx,edx //so i added 4 times ^^
add ebx,edx //add
add ebx,edx //last adding
//below, mov [ebx],ah is the only memory accessing instruction
mov [ebx],ah //(8 bit)this line is equivalent to var_3[i]=remainder
inc edx //i++;
cmp al,00h //is division zero?
jne dng //if no, loop again
//here edi register has the number of digits
movd eax,xmm0 //get the outer loop counter from medium-big register
add eax,01h //j++;
movd xmm0,eax //store the new counter to medium-big register
cmp eax,0BEBC200h //is j<(200,000,000) ?
jb for_loop //if yes, go loop again
mov [var_3_size],edx //now we have number of digits too!
//here does registers revert back to old values
pop edi
pop edx
pop ecx
pop ebx
pop eax
popf
}
t2=clock(); //finish time
printf("\n assembly_inline(clocks): %i for the 200 million calculations",(t2-t1));
printf("\n value %i(in decimal) is: ",var_1);
for(int i=var_3_size-1;i>=0;i--)
{
printf("%i",var_3[i]);
}
printf(" in the number system: %i \n",var_2);
//and: more readable form(end easier)
int counter=var_3_size;
int x=var_1;
int g=var_2;
int y=x;// backup
t1=clock();//reference time
for(int i=0;i<200000000;i++)
{
y=x;
counter=0;
while(y>g)
{
var_3[counter]=y%g;
y/=g;
counter++;
}
var_3[counter]=y%g;
}
t2=clock();//final time
printf("\n C++(clocks): %i for the 200 million calculations",(t2-t1));
printf("\n value %i(in decimal) is: ",x);
for(int i=var_3_size-1;i>=0;i--)
{
printf("%i",var_3[i]);
}
printf(" in the number system: %i \n",g);
return 0;
}
edit:
this is 32-bit version
void get_digits_asm()
{
__asm
{
pushf //couldnt store this in other registers
movd xmm0,eax//storing in xmm registers instead of pushing
movd xmm1,ebx//
movd xmm2,ecx//
movd xmm3,edx//
movd xmm4,edi//end of push backups
mov eax,[variable_x]
mov ebx,[number_system]
mov ecx,0h
mov edi,0h
begin_loop:
mov edx,0h
div ebx
lea edi,digits
mov [edi+ecx*4],edx
add ecx,01h
cmp eax,ebx
ja begin_loop
mov edx,0
div ebx
lea edi,digits
mov [edi+ecx*4],edx
inc ecx
mov [digits_total],ecx
movd edi,xmm4//pop edi
movd edx,xmm3//pop edx
movd ecx,xmm2//pop ecx
movd ebx,xmm1//pop ebx
movd eax,xmm0//pop eax
popf
}
}
The code can be much simpler of course: (modeled after the C++ version, does not include pushes and pops, and not tested)
mov esi,200000000
_bigloop:
mov eax,[y]
mov ebx,[g]
lea edi,var_3
; eax = y
; ebx = g
; edi = var_3
xor ecx,ecx
; ecx = counter
_loop:
xor edx,edx
div ebx
mov [edi+ecx*4],edx
add ecx,1
test eax,eax
jnz _loop
sub esi,1
jnz _bigloop
But I would be surprised if it was faster than the C++ version, and in fact it'll almost certainly be slower if the base is a power of two - all sane compilers know how to turn a division and/or modulo by a power of two into bitshifts and bitwise ands.
Here's a version that uses ab 8-bit division. Similar caveats apply, but now the division could even overflow (if y / g is more than 255).
mov esi,200000000
_bigloop:
mov eax,[y]
mov ebx,[g]
lea edi,var_3
; eax = y
; ebx = g
; edi = var_3
xor ecx,ecx
; ecx = counter
_loop:
div bl
mov [edi+ecx],ah
add ecx,1
and eax,0xFF
jnz _loop
sub esi,1
jnz _bigloop