Let LLVM choose int size depend on platform - llvm

I'm writing a compiler and I want my front end has little to do with the platform details, especially the size of the proto-types(int, long, etc).
For now, if I create a int variable, I have to use IntegerType::get(mod->getContext(), 32). By using this, I have to know the platform information and set a 32 or 16.
Since I want my front end has little to do with platform, is there any mechanism to let LLVM choose a size for a type for me?

It's not quite what you want, but LLVM does have a mechanism for asking the target platform about sizes, called DataLayout. Here's some code I use to generate debug info:
const DataLayout & dl = getModule().getDataLayout();
uint sizeInBits = 0;
if(...)
sizeInBits = dl.getStructLayout(getObjectStructType())->getSizeInBits();
...
DataLayout doesn't offer a single size for ints, because the CPU may may have several, as the current members of the Intel 4004 family do.
IIRC all of the current members offer 8-bit, 16-bit, 32-bit and 64-bit ints using more-or-less extended versions of the same registers (AL, AX, EAX, RAX), with the same performance for all operations except reading/writing to memory. But you can ask LLVM which ones of 16/32/48/64 are good and then pick an int size that suits you.

Since "int" is a C/C++ feature, then its size is specified by C/C++ platform ABI. clang knows the necessary sizes / alignments for sure, so you'd need to integrate with C/C++ frontend one way or another (e.g. execute clang once to derive the necessary sizes).

Related

Is it necessary to take care of long data type in program that is written for Windows and Linux?

According to cpp reference in 64 bit systems:
LLP64 or 4/4/8 (int and long are 32-bit, pointer is 64-bit)
Win64 API
LP64 or 4/8/8 (int is 32-bit, long and pointer are 64-bit)
Unix and Unix-like systems (Linux, Mac OS X)
Then how to consider long data type for codes which is written for Linux and Windows?
In C and C++, in portable code, you never know the exact size of a type like int or long int. If you move your code to a different compiler (or a different machine, or a different OS), the sizes of some of your types may change. This needn't be a problem; in fact it's only a problem if you want to make it a problem. (All of this has always been the case, and has nothing to do with someone's definitions of "LLP64" and "LP64" architecture families.)
On those (hopefully rare) occasions when you need a type of an exact size, one good way is to use types like int32_t and uint64_t from <cstdint> (or <stdint.h> in C).
But you really, really shouldn't need to specify the exact size of a type, most of the time. (There are those who say you need to specify the exact size of every type, but my advice is to ignore those people.)
Pretty much the only time you need to specify exact sizes is when trying to define a structure which you can read and write in "binary" fashion to conform to some externally-imposed storage layout. But there, specifying the exact sizes of data types isn't generally sufficient, because of issues like alignment, padding, and byte order. So you're better off writing explicit serialization and deserialization code anyway (or using "text" data formats instead, if you can get away with it).
My bottom line is that I rarely worry about the exact sizes of types.

How to find out if we are using really 48, 56 or 64 bits pointers

I am using tricks to store extra information in pointers, At the moment some bits are not used in pointers(the highest 16 bits), but this will change in the future. I would like to have a way to detect if we are compiling or running on a platform that will use more than 48 bits for pointers.
related things:
Why can't OS use entire 64-bits for addressing? Why only the 48-bits?
http://developer.amd.com/wordpress/media/2012/10/24593_APM_v2.pdf
The solution is needed for x86-64, Windows, C/C++, preferably something that can be done compile-time.
Solutions for other platforms are also of interest but will not marked as correct answer.
Windows has exactly one switch for 32bit and 64bit programs to determine the top of their virtual-address-space:
IMAGE_FILE_LARGE_ADDRESS_AWARE
For both types, omitting it limits the program to the lower 2 GB of address-space, severely reducing the memory an application can map and thus also reducing effectiveness of Address-Space-Layout-Randomization (ASLR, an attack mitigation mechanism).
There is one upside to it though, and just what you seem to want: Only the lower 31 bits of a pointer can be set, so pointers can be safely round-tripped through int (32 bit integer, sign- or zero-extension).
At run-time the situation is slightly better:
Just use the cpuid-instruction from intel, function EAX=80000008H, and read the maximum number of used address bits for virtual addresses from bits 8-15.
The OS cannot use more than the CPU supports, remember that intel insists on canonical addresses (sign-extended).
See here for how to use cpuid from C++: CPUID implementations in C++

How to write convertible code, 32 bit/64 bit?

A c++ specific question. So i read a question about what makes a program 32 bit/64 bit, and the anwser it got was something like this (sorry i cant find the question, was somedays ago i looked at it and i cant find it again:( ): As long as you dont make any "pointer assumptions", you only need to recompile it. So my question is, what are pointer assumtions ? To my understanding there is 32 bit pointer and 64 bit pointers so i figure it is something to do with that . Please show the diffrence in code between them. Any other good habits to keep in mind while writing code, that helps it making it easy to convert between the to are also welcome :) tho please share examples with them
Ps. I know there is this post:
How do you write code that is both 32 bit and 64 bit compatible?
but i tougth it was kind of to generall with no good examples, for new programmers like myself. Like what is a 32 bit storage unit ect. Kinda hopping to break it down a bit more (no pun intended ^^ ) ds.
In general it means that your program behavior should never depend on the sizeof() of any types (that are not made to be of some exact size), neither explicitly nor implicitly (this includes possible struct alignments as well).
Pointers are just a subset of them, and it probably also means that you should not try to rely on being able to convert between unrelated pointer types and/or integers, unless they are specifically made for this (e.g. intptr_t).
In the same way you need to take care of things written to disk, where you should also never rely on the size of e.g. built in types, being the same everywhere.
Whenever you have to (because of e.g. external data formats) use explicitly sized types like uint32_t.
For a well-formed program (that is, a program written according to syntax and semantic rules of C++ with no undefined behaviour), the C++ standard guarantees that your program will have one of a set of observable behaviours. The observable behaviours vary due to unspecified behaviour (including implementation-defined behaviour) within your program. If you avoid unspecified behaviour or resolve it, your program will be guaranteed to have a specific and certain output. If you write your program in this way, you will witness no differences between your program on a 32-bit or 64-bit machine.
A simple (forced) example of a program that will have different possible outputs is as follows:
int main()
{
std::cout << sizeof(void*) << std::endl;
return 0;
}
This program will likely have different output on 32- and 64-bit machines (but not necessarily). The result of sizeof(void*) is implementation-defined. However, it is certainly possible to have a program that contains implementation-defined behaviour but is resolved to be well-defined:
int main()
{
int size = sizeof(void*);
if (size != 4) {
size = 4;
}
std::cout << size << std::endl;
return 0;
}
This program will always print out 4, despite the fact it uses implementation-defined behaviour. This is a silly example because we could have just done int size = 4;, but there are cases when this does appear in writing platform-independent code.
So the rule for writing portable code is: aim to avoid or resolve unspecified behaviour.
Here are some tips for avoiding unspecified behaviour:
Do not assume anything about the size of the fundamental types beyond that which the C++ standard specifies. That is, a char is at least 8 bit, both short and int are at least 16 bits, and so on.
Don't try to do pointer magic (casting between pointer types or storing pointers in integral types).
Don't use a unsigned char* to read the value representation of a non-char object (for serialisation or related tasks).
Avoid reinterpret_cast.
Be careful when performing operations that may over or underflow. Think carefully when doing bit-shift operations.
Be careful when doing arithmetic on pointer types.
Don't use void*.
There are many more occurrences of unspecified or undefined behaviour in the standard. It's well worth looking them up. There are some great articles online that cover some of the more common differences that you'll experience between 32- and 64-bit platforms.
"Pointer assumptions" is when you write code that relies on pointers fitting in other data types, e.g. int copy_of_pointer = ptr; - if int is a 32-bit type, then this code will break on 64-bit machines, because only part of the pointer will be stored.
So long as pointers are only stored in pointer types, it should be no problem at all.
Typically, pointers are the size of the "machine word", so on a 32-bit architecture, 32 bits, and on a 64-bit architecture, all pointers are 64-bit. However, there are SOME architectures where this is not true. I have never worked on such machines myself [other than x86 with it's "far" and "near" pointers - but lets ignore that for now].
Most compilers will tell you when you convert pointers to integers that the pointer doesn't fit into, so if you enable warnings, MOST of the problems will become apparent - fix the warnings, and chances are pretty decent that your code will work straight away.
There will be no difference between 32bit code and 64bit code, the goal of C/C++ and other programming languages are their portability, instead of the assembly language.
The only difference will be the distrib you'll compile your code on, all the work is automatically done by your compiler/linker, so just don't think about that.
But: if you are programming on a 64bit distrib, and you need to use an external library for example SDL, the external library will have to also be compiled in 64bit if you want your code to compile.
One thing to know is that your ELF file will be bigger on a 64bit distrib than on a 32bit one, it's just logic.
What's the point with pointer? when you increment/change a pointer, the compiler will increment your pointer from the size of the pointing type.
The contained type size is defined by your processor's register size/the distrib your working on.
But you just don't have to care about this, the compilation will do everything for you.
Sum: That's why you can't execute a 64bit ELF file on a 32bit distrib.
Typical pitfalls for 32bit/64bit porting are:
The implicit assumption by the programmer that sizeof(void*) == 4 * sizeof(char).
If you're making this assumption and e.g. allocate arrays that way ("I need 20 pointers so I allocate 80 bytes"), your code breaks on 64bit because it'll cause buffer overruns.
The "kitten-killer" , int x = (int)&something; (and the reverse, void* ptr = (void*)some_int). Again an assumption of sizeof(int) == sizeof(void*). This doesn't cause overflows but looses data - the higher 32bit of the pointer, namely.
Both of these issues are of a class called type aliasing (assuming identity / interchangability / equivalence on a binary representation level between two types), and such assumptions are common; like on UN*X, assuming time_t, size_t, off_t being int, or on Windows, HANDLE, void* and long being interchangeable, etc...
Assumptions about data structure / stack space usage (See 5. below as well). In C/C++ code, local variables are allocated on the stack, and the space used there is different between 32bit and 64bit mode due to the point below, and due to the different rules for passing arguments (32bit x86 usually on the stack, 64bit x86 in part in registers). Code that just about gets away with the default stacksize on 32bit might cause stack overflow crashes on 64bit.
This is relatively easy to spot as a cause of the crash but depending on the configurability of the application possibly hard to fix.
Timing differences between 32bit and 64bit code (due to different code sizes / cache footprints, or different memory access characteristics / patterns, or different calling conventions ) might break "calibrations". Say, for (int i = 0; i < 1000000; ++i) sleep(0); is likely going to have different timings for 32bit and 64bit ...
Finally, the ABI (Application Binary Interface). There's usually bigger differences between 64bit and 32bit environments than the size of pointers...
Currently, two main "branches" of 64bit environments exist, IL32P64 (what Win64 uses - int and long are int32_t, only uintptr_t/void* is uint64_t, talking in terms of the sized integers from ) and LP64 (what UN*X uses - int is int32_t, long is int64_t and uintptr_t/void* is uint64_t), but there's the "subdivisions" of different alignment rules as well - some environments assume long, float or double align at their respective sizes, while others assume they align at multiples of four bytes. In 32bit Linux, they align all at four bytes, while in 64bit Linux, float aligns at four, long and double at eight-byte multiples.
The consequence of these rules is that in many cases, bith sizeof(struct { ...}) and the offset of structure/class members are different between 32bit and 64bit environments even if the data type declaration is completely identical.
Beyond impacting array/vector allocations, these issues also affect data in/output e.g. through files - if a 32bit app writes e.g. struct { char a; int b; char c, long d; double e } to a file that the same app recompiled for 64bit reads in, the result will not be quite what's hoped for.
The examples just given are only about language primitives (char, int, long etc.) but of course affect all sorts of platform-dependent / runtime library data types, whether size_t, off_t, time_t, HANDLE, essentially any nontrivial struct/union/class ... - so the space for error here is large,
And then there's the lower-level differences, which come into play e.g. for hand-optimized assembly (SSE/SSE2/...); 32bit and 64bit have different (numbers of) registers, different argument passing rules; all of this affects strongly how such optimizations perform and it's very likely that e.g. SSE2 code which gives best performance in 32bit mode will need to be rewritten / needs to be enhanced to give best performance 64bit mode.
There's also code design constraints which are very different for 32bit and 64bit, particularly around memory allocation / management; an application that's been carefully coded to "maximize the hell out of the mem it can get in 32bit" will have complex logic on how / when to allocate/free memory, memory-mapped file usage, internal caching, etc - much of which will be detrimental in 64bit where you could "simply" take advantage of the huge available address space. Such an app might recompile for 64bit just fine, but perform worse there than some "ancient simple deprecated version" which didn't have all the maximize-32bit peephole optimizations.
So, ultimately, it's also about enhancements / gains, and that's where more work, partly in programming, partly in design/requirements comes in. Even if your app cleanly recompiles both on 32bit and 64bit environments and is verified on both, is it actually benefitting from 64bit ? Are there changes that can/should be done to the code logic to make it do more / run faster in 64bit ? Can you do those changes without breaking 32bit backward compatibility ? Without negative impacts on the 32bit target ? Where will the enhancements be, and how much can you gain ?
For a large commercial project, answers to these questions are often important markers on the roadmap because your starting point is some existing "money maker"...

C++: Datatypes, which to use and when?

I've been told that I should use size_t always when I want 32bit unsigned int, I don't quite understand why, but I think it has something to do with that if someone compiles the program on 16 or 64 bit machines, the unsigned int would become 16 or 64 bit but size_t won't, but why doesn't it? and how can I force the bit sizes to exactly what I want?
So, where is the list of which datatype to use and when? for example, is there a size_t alternative to unsigned short? or for 32bit int? etc. How can I be sure my datatypes have as many bits as I chose at the first place and not need to worry about different bit sizes on other machines?
Mostly I care more about the memory used rather than the marginal speed boost I get from doubling the memory usage, since I have not much RAM. So I want to stop worrying will everything break apart if my program is compiled on a machine that's not 32bit. For now I've used size_t always when i want it to be 32bit, but for short I don't know what to do. Someone help me to clear my head.
On the other hand: If I need 64 bit size variable, can I use it on a 32bit machine successfully? and what is that datatype name (if i want it to be 64bit always) ?
size_t is for storing object sizes. It is of exactly the right size for that and only that purpose - 4 bytes on 32-bit systems and 8 bytes on 64-bit systems. You shouldn't confuse it with unsigned int or any other datatype. It might be equivalent to unsigned int or might be not depending on the implementation (system bitness included).
Once you need to store something other than an object size you shouldn't use size_t and should instead use some other datatype.
As a side note: For containers, to indicate their size, don't use size_t, use container<...>::size_type
boost/cstdint.hpp can be used to be sure integers have right size.
size_t is not not necessarily 32-bit. It has been 16-bit with some compilers. It's 64-bit on a 64-bit system.
The C++ standard guarantees, via reference down to the C standard, that long is at least 32 bits.
int is only formally guaranteed 16 bits, but in practice I wouldn't worry: the chance that any ordinary code will be used on a 16-bit system is slim indeed, and on any 32-bit system int is 32-bit. Of course it's different if you're coding for a 16-bit system like some embedded computer. But in that case you'd probably be writing system-specific code anyway.
Where you need exact sizes you can use <stdint.h> if your compiler supports that header (it was introduced in C99, and the current C++ standard stems from 1998), or alternatively the corresponding Boost library header boost/cstdint.hpp.
However, in general, just use int. ;-)
Cheers & hth.,
size_t is not always 32-bit. E.g. It's 64-bit on 64-bit platforms.
For fixed-size integers, stdint.h is best. But it doesn't come with VS2008 or earlier - you have to download it separately. (It comes as a standard part of VS2010 and most other compilers).
Since you're using VS2008, you can use the MS-specific __int32, unsigned __int32 etc types. Documentation here.
To answer the 64-bit question: Most modern compilers have a 64-bit type, even on 32-bit systems. The compiler will do some magic to make it work. For Microsoft compilers, you can just use the __int64 or unsigned __int64 types.
Unfortunately, one of the quirks of the nature of data types is that it depends a great deal on which compiler you're using. Naturally, if you're only compiling for one target, there is no need to worry - just find out how large the type is using sizeof(...).
If you need to cross-compile, you could ensure compatibility by defining your own typedefs for each target (surrounded #ifdef blocks, referencing which target you're cross-compiling to).
If you're ever concerned that it could be compiled on a system that uses types with even weirder sizes than you have anticipated, you could always assert(sizeof(short)==2) or equivalent, so that you could guarantee at runtime that you're using the correctly sized types.
Your question is tagged visual-studio-2008, so I would recommend looking in the documentation for that compiler for pre-defined data types. Microsoft has a number that are predefined, such as BYTE, DWORD, and LARGE_INTEGER.
Take a look in windef.h winnt.h for more.

Is there a relation between integer and register sizes?

Recently, I was challenged in a recent interview with a string manipulation problem and asked to optimize for performance. I had to use an iterator to move back and forth between TCHAR characters (with UNICODE support - 2bytes each).
Not really thinking of the array length, I made a curial mistake with not using size_t but an int to iterate through. I understand it is not compliant and not secure.
int i, size = _tcslen(str);
for(i=0; i<size; i++){
// code here
}
But, the maximum memory we can allocate is limited. And if there is a relation between int and register sizes, it may be safe to use an integer.
E.g.: Without any virtual mapping tools, we can only map 2^register-size bytes. Since TCHAR is 2 bytes long, half of that number. For any system that has int as 32-bits, this is not going to be a problem even if you dont use an unsigned version of int. People with embedded background used to think of int as 16-bits, but memory size will be restricted on such a device. So I wonder if there is a architectural fine-tuning decision between integer and register sizes.
The C++ standard doesn't specify the size of an int. (It says that sizeof(char) == 1, and sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long).
So there doesn't have to be a relation to register size. A fully conforming C++ implementation could give you 256 byte integers on your PC with 32-bit registers. But it'd be inefficient.
So yes, in practice, the size of the int datatype is generally equal to the size of the CPU's general-purpose registers, since that is by far the most efficient option.
If an int was bigger than a register, then simple arithmetic operations would require more than one instruction, which would be costly. If they were smaller than a register, then loading and storing the values of a register would require the program to mask out the unused bits, to avoid overwriting other data. (That is why the int datatype is typically more efficient than short.)
(Some languages simply require an int to be 32-bit, in which case there is obviously no relation to register size --- other than that 32-bit is chosen because it is a common register size)
Going strictly by the standard, there is no guarantee as to how big/small an int is, much less any relation to the register size. Also, some architectures have different sizes of registers (i.e: not all registers on the CPU are the same size) and memory isn't always accessed using just one register (like DOS with its Segment:Offset addressing).
With all that said, however, in most cases int is the same size as the "regular" registers since it's supposed to be the most commonly used basic type and that's what CPUs are optimized to operate on.
AFAIK, there is no direct link between register size and the size of int.
However, since you know for which platform you're compiling the application, you can define your own type alias with the sizes you need:
Example
#ifdef WIN32 // Types for Win32 target
#define Int16 short
#define Int32 int
// .. etc.
#elif defined // for another target
Then, use the declared aliases.
I am not totally aware, if I understand this correct, since some different problems (memory sizes, allocation, register sizes, performance?) are mixed here.
What I could say is (just taking the headline), that on most actual processors for maximum speed you should use integers that match register size. The reason is, that when using smaller integers, you have the advantage of needing less memory, but for example on the x86 architecture, an additional command for conversion is needed. Also on Intel you have the problem, that accesses to unaligned (mostly on register-sized boundaries) memory will give some penality. Off course, on todays processors things are even more complex, since the CPUs are able to process commands in parallel. So you end up fine tuning for some architecture.
So the best guess -- without knowing the architectore -- speeedwise is, to use register sized ints, as long you can afford the memory.
I don't have a copy of the standard, but my old copy of The C Programming Language says (section 2.2) int refers to "an integer, typically reflecting the natural size of integers on the host machine." My copy of The C++ Programming Language says (section 4.6) "the int type is supposed to be chosen to be the most suitable for holding and manipulating integers on a given computer."
You're not the only person to say "I'll admit that this is technically a flaw, but it's not really exploitable."
There are different kinds of registers with different sizes. What's important are the address registers, not the general purpose ones. If the machine is 64-bit, then the address registers (or some combination of them) must be 64-bits, even if the general-purpose registers are 32-bit. In this case, the compiler may have to do some extra work to actually compute 64-bit addresses using multiple general purpose registers.
If you don't think that hardware manufacturers ever make odd design choices for their registers, then you probably never had to deal with the original 8086 "real mode" addressing.