How to use intrinsics for inline assemby in C++? - c++

I try to port a C++ tool to x64 in VS2005. The problem is, that the code contains inline assembly, which is not supported by the 64bit compiler. My question is, if there is much more effort to code it with clear c++ or use intrinsics. But in this case not all assembler functions are available for x64, am I right? Let say, I have a simple program
#include <stdio.h>
void main()
{
int a = 5;
int b = 3;
int res = 0;
_asm
{
mov eax,a
add eax,b
mov res,eax
}
printf("%d + %d = %d\n", a, b, res);
}
How must I change this code using intrinsics to run it? I'm new in assembler and do not know about most of its functions.
UPDATE:
I made changes to compile assembly with ml64.exe like Hans suggested.
; add.asm
; ASM function called from C++
.code
;---------------------------------------------
AddInt PROC,
a:DWORD, ; receives an integer
b:DWORD ; receives an integer
; Returns: sum of a and b, in EAX.
;----------------------------------------------
mov eax,a
add eax,b
ret
AddInt ENDP
END
main.cpp
#include <stdio.h>
extern "C" int AddInt(int a, int b);
void main()
{
int a = 5;
int b = 3;
int res = AddInt(a,b);
printf("%d + %d = %d\n", a, b, res);
}
but the result is not correct 5 + 3 = -1717986920. I guess, something goes wrong with pointer. Where did I do a mistake?

Inline assembly isn't supported for 64-bit targets in VC.
Regarding the error in your non-inline code, in a first look the code seems fine. I would look at the generated assembly code from the C++ - to see if it matches the addInt procedure.
Edit: 2 things to note:
Add extern addInt :proc to your asm code.
I'm not aware of an assembly syntax for procedure accepting parameters. The parameters are normally extracted via the stack pointer(sp register) according to your calling convention, see more here: http://courses.engr.illinois.edu/ece390/books/labmanual/c-prog-mixing.html

Related

I'm trying to link assembly code to C++, but I don't know how it works

I want to link assembly code to C++, and below is my code.
This is .cpp
#include <iostream>
#include <time.h>
using namespace std;
extern "C" {int IndexOf(long searchVal, long array[], unsigned count); }
int main()
{
//Fill an array with pseudorandon integers.
const unsigned ARRAY_SIZE = 100000;
const unsigned LOOP_SIZE = 100000;
char F[] = "false";
char T[] = "true";
char* boolstr[] = {F,T};
long array[ARRAY_SIZE];
for (unsigned i = 0; i < ARRAY_SIZE; i++)
array[i] = rand();
long searchval;
time_t startTime, endTime;
cout << "Enter an integer value to find:";
cin >> searchval;
cout << "Please wait... \n";
//Test the Assembly language function.
time(&startTime);
int count = 0;
for (unsigned n = 0; n < LOOP_SIZE; n++)
count = IndexOf(searchval, array, ARRAY_SIZE); //Here
bool found = count != -1;
time(&endTime);
cout << "Elapsed ASM time: " << long(endTime - startTime)
<< " seconds. Found = " << boolstr[found] << endl;
return 0;
}
This is .asm
;IndexOf function (IndexOf.asm)
.586
.model flat, C
Indexof PROTO,
srchval:DWORD, arrayPtr: PTR DWORD, count: DWORD
.code
;---------------------------------------------------------------------
IndexOf PROC USES ecx esi edi,
srchval: DWORD, arrayPtr : PTR DWORD, count: DWORD
;
;Performs a linear search of a 32-bit integer array,
;looking for a specific value. If the value is found,
;the matching index position is returned in EAX;
;otherwise, EAX equals -1.
;---------------------------------------------------------------------
NOT_FOUND = -1
mov eax, srchval ; search value
mov ecx, count ; array size
mov esi, arrayPtr ; pointer to array
mov edi, 0 ;index
L1:
cmp [esi+edi*4], eax
je found
inc edi
loop L1
notFound:
mov al, NOT_FOUND
jmp short exit
found:
mov eax, edi
exit:
ret
IndexOf ENDP
END
It's same code on the textbook Assembly Language by Irvine.
And I already set the build customizations, and check the box next to masm.
I also set .asm Properties, then change the item type to Microsoft Macro Assembler.
I don't know how to link these two files together.
I'm wondering whether the problem is less of some code in .cpp or .asm.
Please help me:(
thank!!
I am not giving an exact answer here, this answer might be a little rough, but in this link pay attention the the link command from the question and if you can get your self two obj files for the cpp and asm I think it should work. If you are learning you could try making an nmake file. I think some assembly programmers do that. I have the Irvine book and I don't remember him mixing c++ and asm
How to write and combine multiple source files for a project in MASM?
It is a little more complicated than what I am saying here, but basically to mix the two they have to be using the same calling convention. On Windows I think that is __stdcall basically the program needs to know how the use the stack between the two source files. Side note, most of the stuff in win32 api headers boils down to __stdcall.
https://learn.microsoft.com/en-us/cpp/cpp/stdcall?view=msvc-170
This is not a pro answer, but it might help
PS I didn't notice your comment of the IndexOf.h, try getting rid of your { } on extern "C" line, I think that is basically what you would find in header file, header file typically declare functions, but do not define functions
Edit additional info:
For your last comment, if you are asking what think you are asking, it works like this: the extern "C" int IndexOf(long searchVal, long array[], unsigned count); is called a function prototype, that means it is telling the compiler that a function with this "indexof" signature exists, it does not however mean the function is implemented or define what that function does. Your implementation is in asm, but it could easily be replaced with c++ as long as the signature is the same. I am guessing one of the errors you may have gotten along the way is a linker error for IndexOf definition not found, that just means your C++ is saying, hey you said we have a indexOf function and I need it but cannot find it. That happens at the linking stage where the object files for the C++ and asm are linked to create the exe file. So yes if you are using Vs you can use your debugger to STEP INTO your assembly code of indexOf or even put the break point in asm file its self.
Side note: in windows (and i guess in linux) for example when programming against the API (functions of windows) your could step into the assembler code, but is is not really source code, it is just the machine code that has been linked into your exe, sometimes you can see that in the stack trace, for example all Windows exes start with ntdll.dll as that contains the process loader. As Linux is open source I guess you code step into the os source if you had it loaded and MS could step into the code of Windows as they own it. It comes down to debugging symbols, look for the pdb files (on Windows platforms with MASM/Vs) in your project folder after building the project.

Why is cuda pointer memory access slower than global device memory access?

#include <vector_functions.h>
#include <vector_types.h>
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <string>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
__device__ int foo[16];
__device__ int bar[16];
__global__ void go(const int* ptr) {
printf("device: tid = %d, foo = %p\n", blockIdx.x, foo);
printf("device: tid = %d, ptr = %p\n", blockIdx.x, ptr);
int val = threadIdx.x;
for (int i = 0; i < (1 << 20); i++) {
bar[blockIdx.x] = val;
val = (val * 19 + ptr[threadIdx.x]) % (int)(1e9 + 7); // change ptr to foo for experiment
}
}
int main() {
int* ptr = nullptr;
cudaGetSymbolAddress((void**)&ptr, foo);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
go<<<16, 16>>>(ptr);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaDeviceSynchronize();
float ms;
cudaEventElapsedTime(&ms, start, stop);
printf("%.6fms\n", ms);
return 0;
}
On my GeForce GTX 1080:
Using ptr takes 180ms but using foo only takes 36ms although ptr and foo point to the exact same address. I thought they should perform at the same speed because they are all global memory cached by L2.
I am using Linux and my compilation command is:
nvcc -gencode=arch=compute_61,code=compute_61 -Xptxas -O3 test.cu -o test
Can anybody explain why?
The reason for the difference in the two cases, is that when foo is used explicitly, the compiler (ptxas, in this case) knows that foo does not alias bar, and so can make a specific optimization. When the kernel argument ptr is used instead, the compiler does not know whether this aliasing is occurring, and assumes it might be. This has significant ramifications for device code generation.
As a proof point, recompile your test case with the following kernel prototype:
__global__ void go(const int* __restrict__ ptr) {
and you will see that the time difference disappears. This is informing the compiler that ptr cannot alias any other known location (such as bar) and so this allows similar code generation in both cases. (In the real world, you would/should only use such decoration when you are prepared to make that kind of contract with the compiler.)
Details:
It's important to remember that the device code compiler is an optimizing compiler. Furthermore, the device code compiler is interested primarily in correctness from a single-thread point of view. Multithreaded access to the same location is not in view of this answer, and indeed is not considered by the device code compiler. It is the programmer's responsibility to ensure correctness when multiple threads are accessing the same location.
With that preamble, the primary difference here appears to be one of optimization. With knowledge that foo (or ptr) does not alias bar and considering only a single thread of execution, it is fairly evident that your kernel loop code could be rewritten as:
int val = threadIdx.x;
int ptrval = ptr[threadIdx.x]; // becomes a LDG instruction
for (int i = 0; i < ((1 << 20)-1); i++) {
val = (val * 19 + ptrval) % (int)(1e9 + 7);
}
bar[blockIdx.x] = val; // becomes a STG instruction
A major impact of this optimization is that we go from writing bar many times to just once. With this optimization, the reads of ptr can also be "optimized into a register" (since we now know it is loop-invariant). The net effect being that all global loads and stores in the loop are eliminated. On the other hand, if ptr may or may not alias bar, then we must allow for the possibility, and the above optimization would not hold.
This appears to be roughly what the compiler is doing. In the case where we use foo (or __restrict__), the compiler has arranged (in the sass code) a single global load at the beginning, a single global store at the end, and a partially unrolled loop full of integer arithmetic.
However, when we leave the code as-is/as-posted, the compiler has also partially unrolled the loop, but has sprinkled LDG and STG instructions throughout the partially unrolled loop.
You can observe this yourself using the cuda binary utilities, for example:
cuobjdump -sass test
(for each case)
The device code printf statements don't materially change any of the observations here, so for simplicity of analysis I would just remove those.

C/C++ inline assembler with instructions in string variables

So as you know in C and C++ if using Visual-C you can have in line assembly instructions such as:
int main() {
printf("Hello\n");
__asm int 3
printf("this will not be printed.\n");
return 0;
}
Which will make a breakpoint inside of the executable. So my question is, is there somekind of function I can use to call __asm using a variable such as a char array. I was thinking something like this:
char instruction[100] = "int 3";
__asm instruction
But that doesn't seem to really work since it gives 'Invalid OP code'. So can you help with this or it isn't possible at all.
Neither C nor C++ are interpreted languages, the compiler generates the int 3 machine instruction at compile time. The compiled program will not recognise the string as an instruction at run-time - unless your program is itself an interpreter.
You can of course use a macro:
#define BREAKPOINT __asm int 3
int main()
{
printf("Hello\n");
BREAKPOINT ;
printf("this will not be printed.\n");
return 0;
}
The code of your program is created by the compiler, during compilation.
You are trying to feed the compiler the input for that at run time - when the program is already executing. If you want ‘on-the-fly-compilation’, you will have to program an incremental compiler and linker that can modify the code while executing it.
Note that even if you would be successful, many OS would block such execution, as it violates security. It would be a great way to build viruses, and is therefore typically blocked.

Do compilers reduce simple functions given constant arguments into unique instructions?

This is something I've always thought to be true but have never had any validation. Consider a very simple function:
int subtractFive(int num) {
return num -5;
}
If a call to this function uses a compile time constant such as
getElement(5);
A compiler with optimizations turned on will very likely inline this. What is unclear to me however, is if the num - 5 will be evaluated at runtime or compile time. Will expression simplification extend recursively through inlined functions in this manner? Or does it not transcend functions?
We can simply look at the generated assembly to find out. This code:
int subtractFive(int num) {
return num -5;
}
int main(int argc, char *argv[]) {
return subtractFive(argc);
}
compiled with g++ -O2 yields
leal -5(%rdi), %eax
ret
So the function call was indeed reduced to a single instruction. This optimization technique is known as inlining.
One can of course use the same technique to see how far a compiler will go with that, e.g. the slightly more complicated
int subtractFive(int num) {
return num -5;
}
int foo(int i) {
return subtractFive(i) * 5;
}
int main(int argc, char *argv[]) {
return foo(argc);
}
still gets compiled to
leal -25(%rdi,%rdi,4), %eax
ret
so here both functions where just eliminated at compile time. If the input to foo is known at compile time, the function call will (in this case) simply be replaced by the resulting constant at compile time (Live).
The compiler can also combine this inlining with constant folding, to replace the function call with its fully evaluated result if all arguments are compile time constants. For example,
int subtractFive(int num) {
return num -5;
}
int foo(int i) {
return subtractFive(i) * 5;
}
int main() {
return foo(7);
}
compiles to
mov eax, 10
ret
which is equivalent to
int main () {
return 10;
}
A compiler will always do this where it thinks it is a good idea, and it is (usually) way better in optimizing code on this low level than you are.
It's easy to do a little test; consider the following
int foo(int);
int bar(int x) { return x-5; }
int baz() { return foo(bar(5)); }
Compiling with g++ -O3 the asm output for function baz is
xorl %edi, %edi
jmp _Z3fooi
This code loads a 0 in the first parameter and then jumps into the code of foo. So the code from bar is completely disappeared and the computation of the value to pass to foo has been done at compile time.
In addition returning the value of calling the function became just a jump to the function code (this is called "tail call optimization").
A smart compiler will evaluate this at compile time and will replace the getElement(5) because it will never have a different result. None of the variables are considered volatile.

Confused about the function return value

#include<iostream>
using namespace std;
int Fun(int x)
{
int sum=1;
if(x>1)
sum=x*Fun(x-1);
else
return sum;
}
int main()
{
cout<<Fun(1)<<endl;
cout<<Fun(2)<<endl;
cout<<Fun(3)<<endl;
cout<<Fun(4)<<endl;
cout<<Fun(5)<<endl;
}
This function is to compute the factorial of an integer number. In the branch of x>1,there is no return value for function Fun. So this function should not return correct answer.
But when fun(4) or some other examples are tested, the right answers are got unexpectedly. Why?
The assembly code of this function is(call Fun(4)):
0x004017E5 push %ebp
0x004017E6 mov %esp,%ebp
0x004017E8 sub $0x28,%esp
0x004017EB movl $0x1,-0xc(%ebp)
0x004017F2 cmpl $0x1,0x8(%ebp)
0x004017F6 jle 0x40180d <Fun(int)+40>
0x004017F8 mov 0x8(%ebp),%eax
0x004017FB dec %eax
0x004017FC mov %eax,(%esp)
0x004017FF call 0x4017e5 <Fun(int)>
0x00401804 imul 0x8(%ebp),%eax
0x00401808 mov %eax,-0xc(%ebp)
0x0040180B jmp 0x401810 <Fun(int)+43>
0x0040180D mov -0xc(%ebp),%eax
0x00401810 leave
0x00401811 ret
May be this is the reason: The value of sum is saved in register eax, and the return value is saved in eax too, so Funreturn the correct result.
Usually, EAX register is used to store return value, ad it is also used to do other stuff as well.
So whatever has been loaded to that register just before the function returns will be the return value, even if you don't intend to do so.
You can use the -S option to generate assembly code and see what happened to EAX right before the "ret" instruction.
When your program pass in the if condition, no return statement finish the function. The number you got is the result of an undefined behavior.
int Fun(int x)
{
int sum=1.0;
if(x>1)
sum=x*Fun(x-1);
else
return sum;
return x; // return something here
}
Just remove else from your code:
int Fun(int x)
{
int sum=1;
if(x>1)
sum=x*Fun(x-1);
return sum;
}
The code you have has a couple of errors:
you have an int being assigned the value 1.0 (which will be implicitly cast/converted), not an error as such but inelegant.
you have a return statement inside a conditionality, so you will only ever get a return when that if is true
If you fix issue with the return by removing the else, then all will be fine:)
As to why it works with 4 as an input, that is down to random chance/ some property of your environment as the code you have posted should be unable to function, as there will always be an instance, when calculating factorials for a positive int, where x = 1 and no return will be generated.
As an aside, here is a more concise/terse function: for so straightforward a function you might consider the ternary operator and use a function like:
int factorial(int x){ return (x>1) ? (x * factorial(x-1)) : 1;}
this is the function I use for my factorials and have had on library for the last 30 or so years (since my C days) :)
From the C standards:
Flowing off the end of a function is equivalent to a return with no
value; this results in undefined behavior in a value-returning
function.
Your situation is the same as this one:
int fun1(int x)
{
int sum = 1;
if(x > 1)
sum++;
else
return sum;
}
int main()
{
int b = fun1(3);
printf("%d\n", b);
return 0;
}
It prints 2 on my machine.
This is calling convention and architecture dependent. The return value is the result of last expression evaluation, stored in the eax register.
As stated in the comment this is undefined behaviour. With g++ I get the following warning.
warning: control reaches end of non-void function [-Wreturn-type]
On Visual C++, the warning is promoted to an error by default
error C4716: 'Fun' : must return a value
When I disabled the warning and ran the resulting executable, Fun(4) gave me 1861810763.
So why might it work under g++? During compilation conditional statements are turned into tests and jumps (or gotos). The function has to return something, and the simplest possible code for the compiler to produce is along the following lines.
int Fun(int x)
{
int sum=1.0;
if(!(x>1))
goto RETURN;
sum=x*Fun(x-1);
RETURN:
return sum;
}
This is consistent with your disassembly.
Of course you can't rely on undefined behaviour, as illustrated by the behaviour in Visual C++. Many shops have a policy to treat warnings as errors for this reason (also as suggested in a comment).