32-bit C++ to 64-bit C++ using Visual Studio 2010 - c++

I recently want to convert a 32-bit C++ project to 64-bit, but I am stuck with the first try. Could you point out any suggestions/checklist/points when converting 32-bit C++ to 64-bit in VS (like converting 32-bit Delphi to 64-bit).
int GetVendorID_0(char *pVendorID,int iLen)
{
#ifdef WIN64 // why WIN64 is not defined switching to Active (x64) ?
// what to put here?
#else
DWORD dwA,dwB,dwC,dwD;
__asm
{
PUSHAD
MOV EAX,0
CPUID //CPUID(EAX=0),
MOV dwA,EAX
MOV dwC,ECX
MOV dwD,EDX
MOV dwB,EBX
POPAD
}
memset( pVendorID, 0,iLen);
memcpy( pVendorID, &dwB,4);
memcpy(&pVendorID[4], &dwD,4);
memcpy(&pVendorID[8], &dwC,4);
return dwA;
#endif
}

Microsoft's compilers (some of them, anyway) have a flag to point out at least some common problems where code will probably need modification to work as 64-bit code.
As far as your GetVendorID_0 function goes, I'd use Microsoft's _cpuid function, something like this:
int GetVendorID_0(char *pVendorID, int iLen) {
DWORD data[4];
_cpuid(0, data);
memcpy(pVendorID, data+1, 12);
return data[0];
}
That obviously doesn't replace all instances of inline assembly language. You choices are fairly simple (though not necessarily easy). One is to find an intrinsic like this to do the job. The other is to move the assembly code into a separate file and link it with your code in C++ (and learn the x64 calling convention). The third is to simply forego what you're doing now, and write the closest equivalent you can with more portable code.

Related

How can I make single object larger than 2GB using new operator?

I'm trying to make a single object larger than 2GB using new operator.
But if the size of the object is larger than 0x7fffffff, The size of memory to be allocated become strange.
I think it is done by compiler because the assembly code itself use strange size of memory allocation.
I'm using Visual Stuio 2015 and configuration is Release, x64.
Is it bug of VS2015? otherwise, I want to know why the limitation exists.
The example code is as below with assembly code.
struct chunk1MB
{
char data[1024 * 1024];
};
class chunk1
{
chunk1MB data1[1024];
chunk1MB data2[1023];
char data[1024 * 1024 - 1];
};
class chunk2
{
chunk1MB data1[1024];
chunk1MB data2[1024];
};
auto* ptr1 = new chunk1;
00007FF668AF1044 mov ecx,7FFFFFFFh
00007FF668AF1049 call operator new (07FF668AF13E4h)
auto* ptr2 = new chunk2;
00007FF668AF104E mov rcx,0FFFFFFFF80000000h // must be 080000000h
00007FF668AF1055 mov rsi,rax
00007FF668AF1058 call operator new (07FF668AF13E4h)
Use a compiler like clang-cl that isn't broken, or that doesn't have intentional signed-32-bit implementation limits on max object size, whichever it is for MSVC. (Could this be affected by a largeaddressaware option?)
Current MSVC (19.33 on Godbolt) has the same bug, although it does seem to handle 2GiB static objects. But not 3GiB static objects; adding another 1GiB member leads to wrong code when accessing a byte more than 2GiB from the object's start (Godbolt -
mov BYTE PTR chunk2 static_chunk2-1073741825, 2 - note the negative offset.)
GCC targeting Linux makes correct code for the case of a 3GiB object, using mov r64, imm64 to get the absolute address into a register, since a RIP-relative addressing mode isn't usable. (In general you'd need gcc -mcmodel=medium to work correctly when some .data / .bss addresses are linked outside the low 2GiB and/or more than 2GiB away from code.)
MSVC seems to have internally truncated the size to signed 32-bit, and then sign-extended. Note the arg it passes to new: mov rcx, 0FFFFFFFF80000000h instead of mov ecx, 80000000h (which would set RCX = 0000000080000000h by implicit zero-extension when writing a 32-bit register.)
In a function that returns sizeof(chunk2); as a size_t, it works correctly, but interestingly prints the size as negative in the source. That might be innocent, e.g. after realizing that the value fits in a 32-bit zero-extended value, MSVC's asm printing code might just always print 32-bit integers as signed decimal, with unsigned hex in a comment.
It's clearly different from how it passes the arg to new; in that case it used 64-bit operand-size in the machine code, so the same 32-bit immediate gets sign-extended to 64-bit, to a huge value near SIZE_MAX, which is of course vastly larger than any possible max object size for x86-64. (The 48-bit virtual address spaces is 1/65536th of the 64-bit value-range of size_t).
unsigned __int64 sizeof_chunk2(void) PROC ; sizeof_chunk2, COMDAT
mov eax, -2147483648 ; 80000000H
ret 0
unsigned __int64 sizeof_chunk2(void) ENDP ; sizeof_chunk2
This looks like a compiler bug or intentional implementation limit; report it to Microsoft if it's not already known.
I'm not sure how to completely solve your issue as it's not answered anywhere I've seen properly.
Memory models are tricky and up until x64 2GB were pretty much the limit.
No basic memory model in Windows support large allocations as far as I know.
Huge pages support 1GB of memory.
However I want to point to different directions:
3 Ways I found to achieve something similar:
The obvious answer - split your allocations to smaller chunks - it's more memory efficient.
Use a different kind of swap, you can write memory to files.
Use virtual memory, not sure if it's helpful to you, using the windows api VirtualAlloc
const static SIZE_T giga = 1024 * 1024 * 1024;
const static SIZE_T size = 4 * giga;
BYTE* ptr = static_cast<BYTE*>(VirtualAlloc(nullptr, (SIZE_T)size, MEM_COMMIT, PAGE_READWRITE));
VirtualFree(ptr, 0, MEM_RELEASE);
Best of luck.

Porting to Mac OS X error

I have the cross-platform audio processing app. It is written using Qt and PortAudio libraries. I also use Chaotic-Daw sources for some audio processing functions (Vibarto effect and Soft-Knee Dynamic range compression). The problem is that I cannot port my app from Windows to Mac OSX because of I get the compiler errors for __asm parts (I use Mac OSX Yosemite and Qt Creator 3.4.1 IDE):
/Users/admin/My
projects/MySound/daw/basics/rosic_NumberManipulations.h:69:
error:
expected '(' after 'asm'
{
^
for such lines:
INLINE int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
#ifndef LINUX
__asm
{ // <========= error indicates that row
fld x;
fadd st, st (0);
fadd round_towards_m_i;
fistp i;
sar i, 1;
}
#else
i = (int) floor(x);
#endif
return (i);
}
How can I resolve this problem?
The code was clearly written for Microsoft's Visual C++ compiler, as that is the syntax it uses for inline assembly. It uses the Intel syntax and is rather simplistic, which makes it easy to write but hinders its optimization potential.
Clang and GCC both use a different format for inline assembly. In particular, they use the GNU AT&T syntax. It is more complicated to write, but much more expressive. The compiler error is basically Clang's way of telling you, "I can tell you're trying to write inline assembly, but you've formatted it all wrong!"
Therefore, to make this code compile, you will need to convert the MSVC-style inline assembly into GAS-format inline assembly. It might look like this:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
__asm__("fadd %[x], %[x] \n\t"
"fadds %[adj] \n\t"
"fistpl %[i] \n\t"
"sarl $1, %[i]"
: [i] "=m" (i) // store result in memory (as required by FISTP)
: [x] "t" (x), // load input onto top of x87 stack (equivalent to FLD)
[adj] "m" (round_towards_m_i)
: "st");
return (i);
}
But, because of the additional expressivity of the GAS style, we can offload more of the work to the built-in optimizer, which may yield even more optimal object code:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
x += x; // equivalent to the first FADD
x += round_towards_m_i; // equivalent to the second FADD
__asm__("fistpl %[i]"
: [i] "=m" (i)
: [x] "t" (x)
: "st");
return (i >> 1); // equivalent to the final SAR
}
Live demonstration
(Note that, technically, a signed right-shift like that done by the last line is implementation-defined in C and would normally be inadvisable. However, if you're using inline assembly, you have already made the decision to target a specific platform and can therefore rely on implementation-specific behavior. In this case, I know and it can easily be demonstrated that all C compilers will generate SAR instructions to do an arithmetic right-shift on signed integer values.)
That said, it appears that the authors of the code intended for the inline assembly to be used only when you are compiling for a platform other than LINUX (presumably, that would be Windows, on which they expected you to be using Microsoft's compiler). So you could get the code to compile simply by ensuring that you are defining LINUX, either on the command line or in your makefile.
I'm not sure why that decision was made; Clang and GCC are both going to generate the same inefficient code that MSVC does (assuming that you are targeting the older generation of x86 processors and unable to use SSE2 instructions). It is up to you: the code will run either way, but it will be slower without the use of inline assembly to force the use of this clever optimization.

malloc() returning address that I cannot access

I have a C++ program that calls some C routines that are generated by Flex / Bison.
When I target a Windows 8.1 64-bit platform, I hit the following exception at runtime:
Unhandled exception at 0x0007FFFA70F2C39 (libapp.dll) in application.exe: 0xC0000005:
Access violation writing location 0x000000005A818118.
I traced this exception to the following piece of code:
YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
{
YY_BUFFER_STATE b;
b = (YY_BUFFER_STATE) yy_flex_alloc( sizeof( struct yy_buffer_state ) );
if ( ! b )
YY_FATAL_ERROR( "out of dynamic memory in yy_create_buffer()" );
b->yy_buf_size = size; // This access is what throws the exception
}
For reference, elsewhere in the code (also generated by Flex / Bison), we have:
typedef struct yy_buffer_state *YY_BUFFER_STATE;
struct yy_buffer_state
{
FILE *yy_input_file;
char *yy_ch_buf;
char *yy_buf_pos;
yy_size_t yy_buf_size;
// ... other fields omitted,
// total struct size is 56 bytes
}
static void *yy_flex_alloc( yy_size_t size )
{
return (void *) malloc( size );
}
I traced back to the malloc call and observed that malloc itself is returning the address 0x000000005A818118. I also checked errno, but it is not set after the call to malloc.
My question is: why does malloc give me an address that I don't have access to, and how can I make it give me a correct address?
Note: I only observe this behavior in Windows 8.1 64-bit. It passes with other 32-bit Windows variants, as well as Windows 7 32-bit.
Compilation information: I am compiling this on a 64-bit Windows 8.1 machine using Visual Studio 2012.
If it helps, here is the disassembled code:
// b = (YY_BUFFER_STATE) yy_flex_alloc( ... )
0007FFFA75E2C12 call yy_flex_alloc (07FFFA75E3070h)
0007FFFA75E2C17 mov qword ptr [b],rax
// if ( ! b ) YY_FATAL_ERROR( ... )
0007FFFA75E2C1C cmp qword ptr [b],0
0007FFFA75E2C22 jne yy_create_buffer+30h (07FFFA75E2C30h)
0007FFFA75E2C24 lea rcx,[yy_chk+58h (07FFFA7646A28h)]
0007FFFA75E2C2B call yy_fatal_error (07FFFA75E3770h)
// b->yy_buf_size = size
0007FFFA75E2C30 mov rax,qword ptr [b]
0007FFFA75E2C35 mov ecx,dword ptr [size]
0007FFFA75E2C39 mov dword ptr [rax+18h],ecx
Thanks!
The real answer is:
When you are compiling flex-generated .c source in Visual Studio, it doesn't include stdlib.h (where malloc defined as returning void*) and Visual Studio takes some own definition, where malloc returns int. (I think it's for some kind of compatibility)
Visual studio prints:
'warning C4013: 'malloc' undefined; assuming extern returning int'
sizeof(int)==4, but values in pointers on x64 systems often exceed 4 bytes
So your pointer just cut to low 4 bytes.
It seems this problem appears only in x64 bits visual studio in .c files.
So, solution will be - just include stdlib.h by yourself, or define some macros, which will lead in flex-generated source to including stdlib.h.
Under normal circumstances malloc() will return a pointer to valid, accessible memory or else NULL. So, your symptoms indicate that malloc() is behaving in an unspecified way. I suspect that, at some point earlier, your program wrote outside of its valid memory, thereby corrupting the data structures used internally by malloc().
Examining your process with a run-time memory analysis tool should help you identify the source of the issue. [See this post for suggestions on memory analysis tools for Windows: Is there a good Valgrind substitute for Windows? ]

Converting inline ASM to intrinsic for x64 igraph

I'm compiling from source the python extension IGRAPH for x64 instead of x86 which is available in the distro. I have gotten it all sorted out in VS 2012 and it compiles when I comment out as follows in src/math.c
#ifndef HAVE_LOGBL
long double igraph_logbl(long double x) {
long double res;
/**#if defined(_MSC_VER)
__asm { fld [x] }
__asm { fxtract }
__asm { fstp st }
__asm { fistp [res] }
#else
__asm__ ("fxtract\n\t"
"fstp %%st" : "=t" (res) : "0" (x));
#endif*/
return res;
}
#endif
The problem is I don't know asm well and I don't know it well enough to know if there are issues going from x86 to x64. This is a short snippet of 4 assembly intsructions that have to be converted to x64 intrinsics, from what I can see.
Any pointers? Is going intrinsic the right way? Or should it be subroutine or pure C?
Edit: Link for igraph extension if anyone wanted to see http://igraph.sourceforge.net/download.html
In x64 floating point will generally be performed using the SSE2 instructions as these are generally a lot faster. Your only problem here is that there is no equivalent to the fxtract op in SSE (Which generally means the FPU version will be implemented as a compound instruction and hence very slow). So implementing as a C function will likely be just as fast on x64.
I'm finding the function a bit hard to read however as from what I can tell it is calling fxtract and then storing an integer value to the address pointed to by a long double. This means the long double is going to have a 'partially' undefined value in it. As best I can tell the above code assembly shouldn't work ... but its been a VERY long time since I wrote any x87 code so I'm probably just rusty.
Anyway the function, appears to be an implementation of logb which you won't find implemented in MSVC. It can, however, be implemented as follows using the frexp function:
long double igraph_logbl(long double x)
{
int exp = 0;
frexpl( x, &exp );
return (long double)exp;
}

Using bts assembly instruction with gcc compiler

I want to use the bts and bt x86 assembly instructions to speed up bit operations in my C++ code on the Mac. On Windows, the _bittestandset and _bittest intrinsics work well, and provide significant performance gains. On the Mac, the gcc compiler doesn't seem to support those, so I'm trying to do it directly in assembler instead.
Here's my C++ code (note that 'bit' can be >= 32):
typedef unsigned long LongWord;
#define DivLongWord(w) ((unsigned)w >> 5)
#define ModLongWord(w) ((unsigned)w & (32-1))
inline void SetBit(LongWord array[], const int bit)
{
array[DivLongWord(bit)] |= 1 << ModLongWord(bit);
}
inline bool TestBit(const LongWord array[], const int bit)
{
return (array[DivLongWord(bit)] & (1 << ModLongWord(bit))) != 0;
}
The following assembler code works, but is not optimal, as the compiler can't optimize register allocation:
inline void SetBit(LongWord* array, const int bit)
{
__asm {
mov eax, bit
mov ecx, array
bts [ecx], eax
}
}
Question: How do I get the compiler to fully optimize around the bts instruction? And how do I replace TestBit by a bt instruction?
BTS (and the other BT* insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT / BTS with a register destination and store the result.
Or maybe shift a 1 to the right position and use OR with with a memory destination for SetBit, or AND with a memory source for TestBit. Of course, if you avoid inline asm, the compiler can inline TestBit and use TEST instead of AND, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND).
This is in fact what gcc 5.2 generates from your C source (memory-dest OR or TEST). Looks optimal to me (fewer uops than a memory-dest bt). Actually, note that your code is broken because it assumes unsigned long is 32 bits, not CHAR_BIT * sizeof(unsigned_long). Using uint32_t, or char, would be a much better plan. Note the sign-extension of eax into rax with the cqde instruction, due to the badly-written C which uses 1 instead of 1UL.
Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:
... ; inline asm
bt reg, reg
setc al ; end of inline asm
test al, al ; compiler-generated
jz bit_was_zero
Modern compilers can and do use BT when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts could work, but bt probably still won't work well (because the result goes into the flags).
inline void SetBit(*array, bit) {
asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}
This version efficiently returns the carry flag (via the gcc-v6 extension mentioned by Peter in the top answer) for a subsequent test instruction. It only supports a register operand since use of a memory operand is very slow as he said:
int variable_test_and_set_bit64(unsigned long long &n, const unsigned long long bit) {
int oldbit;
asm("bts %2,%0"
: "+r" (n), "=#ccc" (oldbit)
: "r" (bit));
return oldbit;
}
Use in code is then like so. The wasSet variable is optimized away and the produced assembly will have bts followed immediately by jb instruction checking the carry flag.
unsigned long long flags = *(memoryaddress);
unsigned long long bitToTest = someOtherVariable;
int wasSet = variable_test_and_set_bit64(flags, bitToTest);
if(!wasSet) {
*(memoryaddress) = flags;
}
Although it seems a bit contrived, this does save me several instructions vs the "1ULL << bitToTest" version.
Another slightly indirect answer, GCC exposes a number of atomic operations starting with version 4.1.