How to use AND to check condition in Assembly language? [duplicate] - if-statement

This question already has answers here:
In NASM labels next to each other in memory are printing both strings instead of first one
(1 answer)
How does $ work in NASM, exactly?
(2 answers)
Code executes condition wrong?
(2 answers)
What if there is no return statement in a CALLed block of code in assembly programs
(2 answers)
why logical"NOT" the condition in if statement in assembly?
(1 answer)
Closed 2 years ago.
I'm trying to check even odd in assembly, using a 32-bit NASM, the code is working fine for odd numbers, but for even numbers it is giving output
Even
Odd
Odd
My code is,
section .data
even db "Even", 0xa;
odd db "Odd", 0xa;
lene equ $-even;
leno equ $-odd;
section .text
global _start;
_start:
mov ax, 0x4;
and ax, 1;
jz evenn;
jnz oddd;
jmp outprog;
evenn:
mov eax, 4;
mov ebx, 1;
mov ecx, even;
mov edx, lene;
int 0x80;
oddd:
mov eax, 4;
mov ebx, 1;
mov ecx, odd;
mov edx, leno;
int 0x80;
outprog:
mov eax, 1;
int 0x80;

void fun ( int x )
{
int y;
y=x&1;
if(y==0)
{
show_even();
}
else
{
show_odd();
}
}
is essentially what you are trying to do and you start off in the right direction, but after the show_odd or show_even you need to take separate paths to the end of the function you don't want to go through the second path
Your even path is going through this code:
evenn:
mov eax, 4;
mov ebx, 1;
mov ecx, even;
mov edx, lene;
int 0x80;
oddd:
mov eax, 4;
mov ebx, 1;
mov ecx, odd;
mov edx, leno;
int 0x80;
which prints Even then falls into the code that prints Odd. You want to branch to outprog after printing Even.
The C code above as a model instead of this:
void fun ( int x )
{
int y;
y=x&1;
if(y==0)
{
show_even();
}
show_odd();
}
where odd is always printed no matter what.
And you can optimize this into a single instruction.
jz evenn;
jnz oddd;
jmp outprog;
vs
jnz oddd;
Think through the code execution path.

Related

why is std::equal much slower than a hand rolled loop for two small std::array?

I was profiling a small piece of code that is part of a larger simulation, and to my surprise, the STL function equal (std::equal) is much slower than a simple for-loop, comparing the two arrays element by element. I wrote a small test case, which I believe to be a fair comparison between the two, and the difference, using g++ 6.1.1 from the Debian archives is not insignificant. I am comparing two, four-element arrays of signed integers. I tested std::equal, operator==, and a small for loop. I didn't use std::chrono for an exact timing, but the difference can be seen explicitly with time ./a.out.
My question is, given the sample code below, why does operator== and the overloaded function std::equal (which calls operator== I believe) take approx 40s to complete, and the hand written loop take only 8s? I'm using a very recent intel based laptop. The for-loop is faster on all optimizations levels, -O1, -O2, -O3, and -Ofast. I compiled the code with
g++ -std=c++14 -Ofast -march=native -mtune=native
Run the code
The loop runs a huge number of times, just to make the difference clear to the naked eye. The modulo operators represent a cheap operation on one of the array elements, and serve to keep the compiler from optimizing out of the loop.
#include<iostream>
#include<algorithm>
#include<array>
using namespace std;
using T = array<int32_t, 4>;
bool
are_equal_manual(const T& L, const T& R)
noexcept {
bool test{ true };
for(uint32_t i{0}; i < 4; ++i) { test = test && (L[i] == R[i]); }
return test;
}
bool
are_equal_alg(const T& L, const T& R)
noexcept {
bool test{ equal(cbegin(L),cend(L),cbegin(R)) };
return test;
}
int main(int argc, char** argv) {
T left{ {0,1,2,3} };
T right{ {0,1,2,3} };
cout << boolalpha << are_equal_manual(left,right) << endl;
cout << boolalpha << are_equal_alg(left,right) << endl;
cout << boolalpha << (left == right) << endl;
bool t{};
const size_t N{ 5000000000 };
for(size_t i{}; i < N; ++i) {
//t = left == right; // SLOW
//t = are_equal_manual(left,right); // FAST
t = are_equal_alg(left,right); // SLOW
left[0] = i % 10;
right[2] = i % 8;
}
cout<< boolalpha << t << endl;
return(EXIT_SUCCESS);
}
Here's the generated assembly of the for loop in main() when the are_equal_manual(left,right) function is used:
.L21:
xor esi, esi
test eax, eax
jne .L20
cmp edx, 2
sete sil
.L20:
mov rax, rcx
movzx esi, sil
mul r8
shr rdx, 3
lea rax, [rdx+rdx*4]
mov edx, ecx
add rax, rax
sub edx, eax
mov eax, edx
mov edx, ecx
add rcx, 1
and edx, 7
cmp rcx, rdi
And here's what's generated when the are_equal_alg(left,right) function is used:
.L20:
lea rsi, [rsp+16]
mov edx, 16
mov rdi, rsp
call memcmp
mov ecx, eax
mov rax, rbx
mov rdi, rbx
mul r12
shr rdx, 3
lea rax, [rdx+rdx*4]
add rax, rax
sub rdi, rax
mov eax, ebx
add rbx, 1
and eax, 7
cmp rbx, rbp
mov DWORD PTR [rsp], edi
mov DWORD PTR [rsp+24], eax
jne .L20
I'm not exactly sure what's happening in the generated code for first case, but it's clearly not calling memcmp(). It doesn't appear to be comparing the contents of the arrays at all. While the loop is still being iterated 5000000000 times, it's optimized to doing nothing much. However, the loop that uses are_equal_alg(left,right) is still performing the comparison. Basically, the compiler is still able to optimize the manual comparison much better than the std::equal template.

Converting this C++ for loop into assembly language [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Basically what I am trying to do is convert this block of c++ & assembly into purely assembly, I am a bit confused as to how to conver the for loop and the rest into assembly, if anyone could let me know where I am going wrong that would be great.
This is the original
void encrypt_chars (int length, char EKey) // Encryption Function.
{ char temp_char; // Char temporary store.
for (int i = 0; i < length; i++) // Encrypt characters one at a time.
{
temp_char = OChars [i]; // Orignal Chars.
__asm { // Switch to inline assembly.
push eax
push ecx
movzx ecx,temp_char
push ecx
lea eax,EKey
push eax
call encrypt4 // Call the encryption subroutine
add esp, 8
mov temp_char,al
pop ecx
pop eax
}
EChars [i] = temp_char; // Store encrypted char in the encrypted chars array.
}
return;
Then this is my attempt at convering the c++ parts into assembly which is what I am stuggling with and would appreciate some pointers -
void encrypt_chars(int lengths, char EKey) // Encryption Function.
{
char temp_char;
__asm {
mov dword ptr[i], 0
jmp encrypt_chars
mov eax, dword ptr[i]
add eax, 1
mov dword ptr[i], eax
mov eax, dword ptr[i]
cmp eax, dword ptr[lengths]
jge encrypt_chars
mov eax, dword ptr[i]
mov cl, byte ptr[eax]
mov byte ptr[temp_char], cl
push eax
push ecx
movzx ecx,temp_char
push ecx
lea eax,EKey
push eax
call encrypt4 // Call the encryption subroutine
add esp, 8
mov temp_char,al
pop ecx
pop eax
mov eax, dword ptr[i]
mov cl, byte ptr[temp_char]
mov byte ptr[eax], cl
}
return;
Here is an assembly example of a for loop:
mov R1, #5 ; This is the limit of the loop
mov R0, #0 ; R0 is the loop index, initialize the loop index variable.
loop:
cmp R0, R1 ; Part of the compare expression in for loop.
bge loop_exit;
;
; The statement block for the for loop.
inc R0 ; The increment part of the for loop
b loop ; Loop around to the compare part of the loop.
; First statement after the for loop
loop_exit:

Is "for( int k = 5; k--;)" faster than "for( int k = 4; k > -1; --k)"

The question says it all: Is
for( int k = 5; k--;)
faster than
for( int k = 4; k > -1; --k)
and why?
EDIT:
I generated the assembly for debug and release in MSVC2012. But (it's my first time analyzing assembly code), I can't really make sense out of it. I alredy added the "std::cout" to prevent the compiler from removing both loops during release optimization.
Can someone help me what the assembly means?
Debug:
; 10 : for( int k = 5; k--;){ std::cout << k; }
mov DWORD PTR _k$2[ebp], 5
$LN5#wmain:
mov eax, DWORD PTR _k$2[ebp]
mov DWORD PTR tv65[ebp], eax
mov ecx, DWORD PTR _k$2[ebp]
sub ecx, 1
mov DWORD PTR _k$2[ebp], ecx
cmp DWORD PTR tv65[ebp], 0
je SHORT $LN4#wmain
mov esi, esp
mov eax, DWORD PTR _k$2[ebp]
push eax
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
cmp esi, esp
call __RTC_CheckEsp
jmp SHORT $LN5#wmain
$LN4#wmain:
; 11 :
; 12 : for( int k = 4; k > -1; --k){ std::cout << k; }
mov DWORD PTR _k$1[ebp], 4
jmp SHORT $LN3#wmain
$LN2#wmain:
mov eax, DWORD PTR _k$1[ebp]
sub eax, 1
mov DWORD PTR _k$1[ebp], eax
$LN3#wmain:
cmp DWORD PTR _k$1[ebp], -1
jle SHORT $LN6#wmain
mov esi, esp
mov eax, DWORD PTR _k$1[ebp]
push eax
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
cmp esi, esp
call __RTC_CheckEsp
jmp SHORT $LN2#wmain
$LN6#wmain:
Release:
; 10 : for( int k = 5; k--;){ std::cout << k; }
mov esi, 5
$LL5#wmain:
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
dec esi
push esi
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
test esi, esi
jne SHORT $LL5#wmain
; 11 :
; 12 : for( int k = 4; k > -1; --k){ std::cout << k; }
mov esi, 4
npad 3
$LL3#wmain:
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
push esi
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
dec esi
cmp esi, -1
jg SHORT $LL3#wmain
[ UPDATE question has been updated so this is no longer different ] They do different things... the first one executes the loop for k values 4 down to 0, while the second one loops from 5 down to 1... if say the loop body does work related to the magnitude of the number, then they might differ in performance.
Ignoring that, on most CPUs k-- incidentally sets the "flags" register commonly called the "zero" flag, so no further explicit comparison is needed before deciding whether to exit. Still, an optimiser should realise that and avoid any unnecessary second comparison even with the second loop.
Generic quip: compilers are allowed to do lots of things, and the Standard certainly doesn't say anything about the relative performance of these two implementations, so ultimately the only way to know - if you have reason to care - is to use the same compiler and command line options you want for production then inspect the generated assembly or machine code and/or measure very carefully. The findings could differ when the executable's deployed on different hardware, compiler with a later version of the compiler, with different flags, a different compiler etc..
Be careful, the two loops are not equivalent:
for( int k = 5; k--;) cout << k << endl;
prints 4 3 2 1 0. While
for( int k = 5; k > 0; k--) cout << k << endl;
prints 5 4 3 2 1.
In performance point of view, you can have enough confidence in your compiler. Modern compilers know how to optimize this better than we do, in most cases.
It depends on your compiler. Probably not, but as always, one must profile to be certain. Or you could look at the generated assembly code (e.g. gcc -S) and see if it's any different. Make sure to enable optimization before you test, too!

Inline asm and c array questions

This is a homework. I have 3 arrays, v1={5,4,3,2,1} ,v2={1,2,3,4,5} and v3={2,3,5,1,4}, the assigment is to change the 1 to 6. Of course, any solution like v1[4]=6, in asm or c is forbidden. So this was my code:
First Code
void main(){
int myArray[5]={5,4,3,2,1};
__asm {
mov ecx,0 //using ecx as counter
myLoop:
mov eax, myArray[ecx] //moving the content on myArray in position ecx to eax
cmp eax,1 //comparing eax to 1
je is_one //if its equal jump to label is_one
inc ecx //ecx+1
cmp ecx,5 //since all vectors have size 5, comparing if ecx is equal to 5
jne myLoop //if not, repeat
jmp Done //if true, go to label Done
is_one:
mov myArray[ecx],6 //changing the content in myArray position ecx to 6
inc ecx //ecx+1
cmp ecx,5 // ecx=5?
jne myLoop //no? repeat loop
jmp Done //yes? Done
Done:
}
printArray(myArray);
}
this didn't work, tried many things like mov eax,6 or mov [eax+ecx],6 , nothing worked until I found this solution
Many tries later code
void main(){
int myArray[5]={5,4,3,2,1};
__asm {
mov ecx,0 //using ecx as counter
myLoop:
mov eax, myArray[TYPE myArray*ecx] //I don't understand how this works
cmp eax,1 //comparing eax to 1
je is_one //if its equal jump to label is_one
inc ecx //ecx+1
cmp ecx,5 //since all vectors have size 5, comparing if ecx is equal to 5
jne myLoop //if not, repeat
jmp Done //if true, go to label Done
is_one:
mov myArray[TYPE myArray*ecx],6 //Uhh...
inc ecx //ecx+1
cmp ecx,5 // ecx=5?
jne myLoop //no? repeat loop
jmp Done //yes? Done
Done:
}
printArray(myArray);
}
And that works like a charm. But I don't understand how or why the MOV array[TYPE array * index], value works(besides TYPE returning the size as explained in link), and why not the others.
Also, since I have to do this for 3 arrays, I tried to copy and paste all the code to changingArray(int myArray[]), declared the 3 arrays in the main, and passed them to changingArray, but now is not changing them. Im pretty sure that with vector you dont have to pass with &, I could be wrong. Still, I can't see why it doesn't change them. So...
Final Code
void changingArray(int myArray[]){
__asm {
mov ecx,0 //using ecx as counter
myLoop:
mov eax, myArray[TYPE myArray*ecx] //I don't understand how this works
cmp eax,1 //comparing eax to 1
je is_one //if its equal jump to label is_one
inc ecx //ecx+1
cmp ecx,5 //since all vectors have size 5, comparing if ecx is equal to 5
jne myLoop //if not, repeat
jmp Done //if true, go to label Done
is_one:
mov myArray[TYPE myArray*ecx],6 //Uhh...
inc ecx //ecx+1
cmp ecx,5 // ecx=5?
jne myLoop //no? repeat loop
jmp Done //yes? Done
Done:
}
printArray(myArray);
}
void main(){
//for some odd reason, they arent changing
int v1[5]={5,4,3,2,1};
int v2[5]={1,2,3,4,5};
int v3[5]={2,3,5,1,4};
changingArray(v1);
changingArray(v2);
changingArray(v3);
}
TL:DR section:
Homework of changing the number 1 to 6 in 3 arrays v1={5,4,3,2,1} ,v2={1,2,3,4,5} and v3={2,3,5,1,4}
1-I don't get why the first code doesn't work, but many tries later code works (the MOV array[TYPE array * index], value instruction).
2- Since I need to do this with 3 arrays, I put all the code in changingArray(int myArray[]), and in the main I declared my 3 arrays in main as shown in final code. While many tries code did change the array, this doesnt. Probably I just made a mistake in c and not asm, but I don't see it.
And sorry for bad english, is not my first language.
mov eax, myArray[TYPE myArray*ecx]
Here the address referred to is (base address of myArray) + sizeof(the type of elements of myArray) * ecx. In assembly language the indexing should be done in bytes.

can you suggest me better solutions for this in C++ inline-assembly?

i am learning assembly and i started experiments on SSE and MMX registers within the Digital-Mars C++ compiler (intel sytanx more easily readable). I have finished a program that takes var_1 as a value and converts it to the var_2 number system(this is in 8 bit for now. will expand it to 32 64 128 later) . Program does this by two ways:
__asm inlining
Usual C++ way of %(modulo) operator.
Question: Can you tell me more efficient way to use xmm0-7 and mm0-7 registers and can you tell me how to exchange exact bytes of them with al,ah... 8 bit registers?
Usual %(modulo) operator in the C++ usual way is very slow in comparison with __asm on my computer(pentium-m centrino 2.0GHz).
If you can tell me how to get rid of division instruction in __asmm, it will be even faster.
When i run the program it gives me:
(for the values: var_1=17,var_2=2,all loops are 200M times)
17 is 10001 in number system 2
__asm(clock)...........: 7250 <------too bad. it is 8-bit calc.
C++(clock).............: 12250 <------not very slow(var_2 is a power of 2)
(for the values: var_1=33,var_2=7,all loops are 200M times)
33 is 45 in number system 7
__asm(clock)..........: 2875 <-------not good. it is 8-bit calc.
C++(clock)............: 6328 <----------------really slow(var_2 is not a power of 2)
The second C++ code(the one with % operator): /////////////////////////////////////////////////////////
t1=clock();//reference time
for(int i=0;i<200000000;i++)
{
y=x;
counter=0;
while(y>g)
{
var_3[counter]=y%g;
y/=g;
counter++;
}
var_3[counter]=y%g;
}
t2=clock();//final time
_asm code:////////////////////////////////////////////////////////////////////////////////////////////////////////////
__asm // i love assembly in some parts of C++
{
pushf //here does register backup
push eax
push ebx
push ecx
push edx
push edi
mov eax,0h //this will be outer loop counter init to zero
//init of medium-big registers to zero
movd xmm0,eax //cannot set to immediate constant: xmm0=outer loop counter
shufps xmm0,xmm0,0h //this makes all bits zero
movd xmm1,eax
movd xmm2,eax
shufps xmm1,xmm1,0h
shufps xmm2,xmm2,0h
movd xmm2,eax
shufps xmm3,xmm3,0h//could have made pxor xmm3,xmm3(single instruction)
//init complete(xmm0,xmm1,xmm2,xmm3 are zero)
movd xmm1,[var_1] //storing variable_1 to register
movd xmm2,[var_2] //storing var_2 to register
lea ebx,var_3 //calculate var_3 address
movd xmm3,ebx //storing var_3's address to register
for_loop:
mov eax,0h
//this line is index-init to zero(digit array index)
movd edx,xmm2
mov cl,dl //this is the var_1 stored in cl
movd edx,xmm1
mov al,dl //this is the var_2 stored in al
mov edx,0h
dng:
mov ah,00h //preparation for a 8-bit division
div cl //divide
movd ebx,xmm3 //get var_3 address
add ebx,edx //i couldnt find a way to multiply with 4
add ebx,edx //so i added 4 times ^^
add ebx,edx //add
add ebx,edx //last adding
//below, mov [ebx],ah is the only memory accessing instruction
mov [ebx],ah //(8 bit)this line is equivalent to var_3[i]=remainder
inc edx //i++;
cmp al,00h //is division zero?
jne dng //if no, loop again
//here edi register has the number of digits
movd eax,xmm0 //get the outer loop counter from medium-big register
add eax,01h //j++;
movd xmm0,eax //store the new counter to medium-big register
cmp eax,0BEBC200h //is j<(200,000,000) ?
jb for_loop //if yes, go loop again
mov [var_3_size],edx //now we have number of digits too!
//here does registers revert back to old values
pop edi
pop edx
pop ecx
pop ebx
pop eax
popf
}
Whole code://///////////////////////////////////////////////////////////////////////////////////////
#include <iostream.h>
#include <cmath>
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main()
{
srand(time(0));
clock_t t1=clock();
clock_t t2=clock();
int var_1=17; //number itself
int var_2=2; //number system
int var_3[100]; //digits to be showed(maximum 100 as seen )
int var_3_size=0;//asm block will decide what will the number of digits be
for(int i=0;i<100;i++)
{
var_3[i]=0; //here we initialize digits to zeroes
}
t1=clock();//reference time to take
__asm // i love assembly in some parts of C++
{
pushf //here does register backup
push eax
push ebx
push ecx
push edx
push edi
mov eax,0h //this will be outer loop counter init to zero
//init of medium-big registers to zero
movd xmm0,eax //cannot set to immediate constant: xmm0=outer loop counter
shufps xmm0,xmm0,0h //this makes all bits zero
movd xmm1,eax
movd xmm2,eax
shufps xmm1,xmm1,0h
shufps xmm2,xmm2,0h
movd xmm2,eax
shufps xmm3,xmm3,0h
//init complete(xmm0,xmm1,xmm2,xmm3 are zero)
movd xmm1,[var_1] //storing variable_1 to register
movd xmm2,[var_2] //storing var_2 to register
lea ebx,var_3 //calculate var_3 address
movd xmm3,ebx //storing var_3's address to register
for_loop:
mov eax,0h
//this line is index-init to zero(digit array index)
movd edx,xmm2
mov cl,dl //this is the var_1 stored in cl
movd edx,xmm1
mov al,dl //this is the var_2 stored in al
mov edx,0h
dng:
mov ah,00h //preparation for a 8-bit division
div cl //divide
movd ebx,xmm3 //get var_3 address
add ebx,edx //i couldnt find a way to multiply with 4
add ebx,edx //so i added 4 times ^^
add ebx,edx //add
add ebx,edx //last adding
//below, mov [ebx],ah is the only memory accessing instruction
mov [ebx],ah //(8 bit)this line is equivalent to var_3[i]=remainder
inc edx //i++;
cmp al,00h //is division zero?
jne dng //if no, loop again
//here edi register has the number of digits
movd eax,xmm0 //get the outer loop counter from medium-big register
add eax,01h //j++;
movd xmm0,eax //store the new counter to medium-big register
cmp eax,0BEBC200h //is j<(200,000,000) ?
jb for_loop //if yes, go loop again
mov [var_3_size],edx //now we have number of digits too!
//here does registers revert back to old values
pop edi
pop edx
pop ecx
pop ebx
pop eax
popf
}
t2=clock(); //finish time
printf("\n assembly_inline(clocks): %i for the 200 million calculations",(t2-t1));
printf("\n value %i(in decimal) is: ",var_1);
for(int i=var_3_size-1;i>=0;i--)
{
printf("%i",var_3[i]);
}
printf(" in the number system: %i \n",var_2);
//and: more readable form(end easier)
int counter=var_3_size;
int x=var_1;
int g=var_2;
int y=x;// backup
t1=clock();//reference time
for(int i=0;i<200000000;i++)
{
y=x;
counter=0;
while(y>g)
{
var_3[counter]=y%g;
y/=g;
counter++;
}
var_3[counter]=y%g;
}
t2=clock();//final time
printf("\n C++(clocks): %i for the 200 million calculations",(t2-t1));
printf("\n value %i(in decimal) is: ",x);
for(int i=var_3_size-1;i>=0;i--)
{
printf("%i",var_3[i]);
}
printf(" in the number system: %i \n",g);
return 0;
}
edit:
this is 32-bit version
void get_digits_asm()
{
__asm
{
pushf //couldnt store this in other registers
movd xmm0,eax//storing in xmm registers instead of pushing
movd xmm1,ebx//
movd xmm2,ecx//
movd xmm3,edx//
movd xmm4,edi//end of push backups
mov eax,[variable_x]
mov ebx,[number_system]
mov ecx,0h
mov edi,0h
begin_loop:
mov edx,0h
div ebx
lea edi,digits
mov [edi+ecx*4],edx
add ecx,01h
cmp eax,ebx
ja begin_loop
mov edx,0
div ebx
lea edi,digits
mov [edi+ecx*4],edx
inc ecx
mov [digits_total],ecx
movd edi,xmm4//pop edi
movd edx,xmm3//pop edx
movd ecx,xmm2//pop ecx
movd ebx,xmm1//pop ebx
movd eax,xmm0//pop eax
popf
}
}
The code can be much simpler of course: (modeled after the C++ version, does not include pushes and pops, and not tested)
mov esi,200000000
_bigloop:
mov eax,[y]
mov ebx,[g]
lea edi,var_3
; eax = y
; ebx = g
; edi = var_3
xor ecx,ecx
; ecx = counter
_loop:
xor edx,edx
div ebx
mov [edi+ecx*4],edx
add ecx,1
test eax,eax
jnz _loop
sub esi,1
jnz _bigloop
But I would be surprised if it was faster than the C++ version, and in fact it'll almost certainly be slower if the base is a power of two - all sane compilers know how to turn a division and/or modulo by a power of two into bitshifts and bitwise ands.
Here's a version that uses ab 8-bit division. Similar caveats apply, but now the division could even overflow (if y / g is more than 255).
mov esi,200000000
_bigloop:
mov eax,[y]
mov ebx,[g]
lea edi,var_3
; eax = y
; ebx = g
; edi = var_3
xor ecx,ecx
; ecx = counter
_loop:
div bl
mov [edi+ecx],ah
add ecx,1
and eax,0xFF
jnz _loop
sub esi,1
jnz _bigloop