i've noticed that my nds application works a little faster when I replace all the instances of bytes with integers. all the examples online put u8/u16 instances whenever possible. is there a specific reason as to why this is the case?
The main processor the Nintendo DS utilizes is ARM9, a 32-bit processor.
Reference: http://en.wikipedia.org/wiki/ARM9
Typically, CPU will conduct operations in word sizes, in this case, 32-bits. Depending on your operations, having to convert the bytes up to integers or vice-versa may be causing additional strain on the processor. This conversion and the potential lack of instructions for values other than 32-bit integers may be causing the lack of speed.
Complementary to what Daniel Li said, memory access on ARM platforms must be word aligned, i.e. memory fetches must be multiple of 32 bits. Fetching a byte variable from memory implies in fetching the whole word containing the relevant byte, and performing the needed bit-wise operations to fit it in the least significant bits of the processor register.
Theses extra instructions are automatically emitted by the compiler, given it knows the actual alignment of your variables.
Related
As we know, X86 CPU has a 64bit data bus. My understanding is that CPU can't access to arbitrary address. The address that CPU could access to is a integral multiple of the width of its data bus. For the performance, variables should start at(aligned to) these addresses to avoid extra memory access. 32bit variables aligned to 4Byte boundry will be automatically aligned to 8Byte(64bit) boundry, which corresponds to x86 64bit data bus. But why compilers align 128bit variables to 16Byte boundry? Not the 8Byte boundry?
Thanks
Let me make things more specific. Compilers use the length of a variable to align it. For example, if a variable has 256bit length, Complier will align it to 32Byte boundry. I don't think there is any kind of CPU has that long data-bus. Furthermore, common DDR memories only transfer 64bit data one time, despite of the cache, how could a memory fill up CPU's wider data-bus? or only by means of cache?
One reasons is that most SSE2 instructions on X86 require the data to be 128 bit aligned. This design decision would have been made for performance reasons and to avoid overly complex (and hence slow and big) hardware.
There are so many different processor models that I am going to answer this only in theoretical and general terms.
Consider an array of 16-byte objects that starts at an address that is a multiple of eight bytes but not of 16 bytes. Let’s suppose the processor has an eight-byte bus, as indicated in the question, even if some processors do not. However, note that at some point in the array, one of the objects must straddle a page boundary: Memory mapping commonly works in 4096-byte pages that start on 4096-byte boundaries. With an eight-byte-aligned array, some element of the array will start at byte 4088 of one page and continue up to byte 7 of the next page.
When a program tries to load the 16-byte object that crosses a page boundary, it can no longer do a single virtual-to-physical memory map. It has to do one lookup for the first eight bytes and another lookup for the second eight bytes. If the load/store unit is not designed for this, then the instruction needs special handling. The processor might abort its initial attempt to execute the instruction, divide it into two special microinstructions, and send those back into the instruction queue for execution. This can delay the instruction by many processor cycles.
In addition, as Hans Passant noted, alignment interacts with cache. Each processor has a memory cache, and it is common for cache to be organized into 32-byte or 64-byte “lines”. If you load a 16-byte object that is 16-byte aligned, and the object is in cache, then the cache can supply one cache line that contains the needed data. If you are loading 16-byte objects from an array that is not 16-byte aligned, then some of the objects in the array will straddle two cache lines. When these objects are loaded, two lines must be fetched from the cache. This may take longer. Even if it does not take longer to get two lines, perhaps because the processor is designed to provide two cache lines per cycle, this can interfere with other things that a program is doing. Commonly, a program will load data from multiple places. If the loads are efficient, the processor may be able to perform two at once. But if one of them requires two cache lines instead of the normal one, then it blocks simultaneous execution of other load operations.
Additionally, some instructions explicitly require aligned addresses. The processor might dispatch these instructions more directly, bypassing some of the tests that fix up operations without aligned addresses. When the addresses of these instructions are resolved and are found to be misaligned, the processor must abort them, because the fix-up operations have been bypassed.
I have a doubt about when to use 64 bits integers when targeting 64 bits OSes.
Has anyone done conclusive studies focused on the speed of the generated code?
It is better to use 64 bits integers as params for funcs or methods? (Ex: uint64 myFunc(uint64 myVar))
If we use 64 bits integers as params it takes more memory but maybe it will be more efficient.
What about if we know that some value should be always less than, for example, 10. We still continue using 64 bit integers for this param?
It is better to use 64 bits integers as return types?
Is there some penalty for using 32-bit as return value?
It is better to use 64 bits integers for loops? (for(size_t i=0; i<...)) In this case, I suppose it.
Is there some penalty for using 32-bit variables for loops?
It is better to use 64 bits integers as indexes for pointers? (Ex: myMemory[index]) In this case, I suppose it.
Is there some penalty for using 32-bit variables for indexes?
It is better to use 64 bits integers to store data in classes or structs? (that we won't want to save to disk or something like this)
It is better to use 64 bits for a bool type?
What about conversions between 64 bits integers and floats? Will be better to use doubles now?
Until now doubles are slower than floats.
Is there some penalty every time we access a 32-bit variable?
Regards!
I agree with #MarkB but want to provide more detail on some topics.
On x64, there are more registers available (twice as many). The standard calling conventions have therefore been designed to take more parameters in registers by default. So as long as the number of parameters is not excessive (typically 4 or fewer), their types will make no difference. They will be promoted to 64 bit and passed in registers anyway.
Space will be allocated on the stack for those 64 bit registers even though they are passed in registers. This is by design to make their storage locations simple and contiguous with the those of surplus parameters. The surplus parameters will be placed on the stack regardless, so size may matter in those cases.
This issue is particularly important for memory data structures. Using 64 bit where 32 bit is sufficient will waste memory, and more importantly, occupy space in cache lines. The cache impact is not simple though. If your data access pattern is sequential, that's when you will pay for it by essentially making half of your cache unusable. (Assuming you only needed half of each 64 bit quantity.)
If your access pattern is random, there is no impact on cache performance. This is because every access occupies a full cache line anyway.
There can be a small impact in accessing integers that are smaller than word size. However, pipelining and multiple issue of instructions will make it so that the extra instruction (zero or sign extend) will almost always become completely hidden and go unobserved.
The upshot of all this is simple: choose the integer size that matters for your problem. For parameters, the compiler can promote them as needed. For memory structure, smaller is typically better.
You have managed to cram a ton of questions into one question here. It looks to me like all your questions basically concern micro-optimizations. As such I'm going to make a two-part answer:
Don't worry about size from a performance perspective but instead use types that are indicative of the data that they will contain and trust the compiler's optimizer to sort it out.
If performance becomes a concern at some point during development, profile your code. Then you can make algorithmic adjustments as appropriate and if the profiler shows that integer operations are causing a problem you can compare different sizes side-by-side for comparison purposes.
Use int and trust the platform and compiler authors that they have done their job and chose the most efficient representation for it. On most 64-bit platforms it is 32-bits which means that it's no less efficient than 64-bit types.
Sorry if the question sounds stupid. I'm only vaguely cognizant of the issue of data alignment and have never done any 64-bit programming. I'm working on some 32-bit x86 code right now. It frequently accesses an array of int. Sometimes one 32-bit integer is read. Sometimes two or more are read. At some point I'd like to make the code 64-bit. What I'm not sure is whether I should declare this int array as int or long int. I would rather keep the width of the integer the same, so I don't have to worry about differences. I'm sort of worried though that reading/writing off an address that isn't aligned to the natural word might be slow.
Misalignment penalties only occur when the load or store crosses an alignment boundary. The boundary is usually the smaller of:
The natural word-size of the hardware. (32-bits or 64-bit*)
The size of the data-type.
If you're loading a 4-byte word on a 64-bit (8-byte) architecture. It does not need to be 8-byte aligned. It only needs to be 4-byte aligned.
Likewise, if you're loading a 1-byte char on any machine, it doesn't need to be aligned at all.
*Note that SIMD vectors can imply a larger natural word-size. For example, 16-byte SSE still requires 16-byte alignment on both x86 and x64. (barring explicit misaligned loads/stores)
So in short, no you don't have to worry about data-alignment. The language and the compiler tries pretty hard to prevent you from having to worry about it.
So just stick with whatever datatype makes the most sense for you.
64-bit x86 CPUs are still heavily optimized for efficient manipulation of 32-bit values. Even on 64-bit operating systems, accessing 32-bit values is at least as fast as accessing 64-bit values. In practice, it will actually be faster because less cache space and memory bandwidth is consumed.
There is a lot of good information available here:
Performance 32 bit vs. 64 bit arithmetic
Even more information https://superuser.com/questions/56540/32-bit-vs-64-bit-systems, where the answer claims to have seen the worst slow down at 5% (from an application perspective, not individual operations).
The short answer is no, you won't take a performance hit.
Whenever you access any memory location an entire cache line is read into L1 cache, and any subsequent access to anything in that line is as fast as possible. Unless your 32-bit access crosses a cache line (which it won't if it's on a 32-bit alignment) it will be as fast as a 64-bit access.
I have come to believe that the optimal size for a boolean variable is the natural width of the data, ie in C/C++ it is int. So for modern processors this is normally 32 bits. At the machine level declaring it as a byte for example requires a 32 bit fetch and then a mask.
However I have seen that a BOOL in iOS is 8 bits. I had assumed that people who used bytes were using left-over ideas from 8 bit processors.
I realise this question depends on the use and for most of the time the language defined boolean is the best bet, but there are times when you need to define your own, such as when you are converting code arriving from an external source or you want to write cross platform code.
It is also significant that if a boolean value is going to be packed into a serial stream, for sending over a serial line such as ethernet or storing it may be optimal to pack the boolean in fewer bits. But I feel that it is likely that it is optimal to pack and unpack from a processor optimal size.
So my question is am I correct in thinking that the optimal size for a boolean on a 32bit processor is 32 bits and if so why does iOS use 8 bits.
Yup you are right it depends. The big advantage of using an 8-bit is that you can pack more into a struct nicely.
Of course you'd be best off using flags in such a case.
The big issue, though, is that with a C/C++ "bool" you don't necessarily know how big it is. This means that you can't make assumptions about a struct (such as binary writing to disk) without the possibility of it breaking on another platform. In such a case using a known sized variable can be very useful and you may as well use as little space as possible if you are going to dump the structure to disk.
The notion of an 8-bit quantity involving a 32-bit fetch followed by hardware masking is mostly obsolete. In reality, a fetch from memory (on a modern processor) will normally be one L2 cache line (typically around 64-128 bytes). That being the case, essentially every size of item you deal with involves fetching a big chunk of data, and then using only some subset of what you fetched (but, assuming your data is more or less contiguous, probably using more of that data subsequently).
C++ attempts (not necessarily successfully) to optimize this a bit for you. An individual bool can be anywhere from one byte on up, though on most typical implementation, it's either one byte or four bytes. The (much reviled) std::vector<bool> uses some tricks to give a (sort of) vector-like interface, but still store each bool in one bit. In the process it loses the ability to be treated as a generic sequence container -- but when you're storing a lot of bools, and can live with the restrictions of using it in an array-like manner, it can actually be a lot more useful than many people believe.
When/if you want to retain normal container semantics and don't mind the extra storage space to keep them their native size, you can use another container (e.g., std::deque<bool>) instead. Especially if you only need to store a small collection of bools, this can often be a superior alternative.
It is architecture dependent, but on many 32 bit architectures 8 bit addressing is no less efficient than 32 bit; the "fetching and masking" as such is performed in hardware logic.
The optimal size in terms of storage space is of course 1 bit. You might for example use bit-fields or bit masking to pack multiple booleans in a single word. Some architectures such as 8051 have bit addressable memory. The more modern ARM Cortex-M architecture employs a technique called bit-banding that allows memory and hardware registers to be bit addressable
A character char maybe of size one byte but when it comes to four bytes value e.g int , how does the cpu differ it from an integer instead of four characters per byte?
The CPU executes code which you made.
This code tells the CPU how to treat the few bytes at a certain memory, like "take the four bytes at address 0x87367, treat them as an integer and add one to the value".
See, it's you who decide how to treat the memory.
Are you asking a question about CPU design?
Each CPU machine instruction is encoded so that the CPU knows how many bits to operate on.
The C++ compiler knows to emit 8-bit instructions for char and 32-bit instructions for int.
In general the CPU by itself knows nothing about the interpretation of values stored at certain memory locations, it's the code that is run (generated, in this case, by the compiler) that it's supposed to know it and use the correct CPU instructions to manipulate such values.
To say it in another way, the type of a variable is an abstraction of the language that tells to the compiler what code to generate to manipulate the memory.
Types in some way do exist at machine code level: there are various instructions to work on various types - i.e. the way the raw values stored in memory are interpreted, but it's up to the code executed to use the correct instructions to treat the values stored in memory correctly.
The compiler has table which named "symbols table", so the compiler know which type is every var and how it should regard it.
This depends on the architecture. Most systems use IEEE 754 Floating Point Representation and Two's Compliment for integer values, but it's up to the CPU in question. It knows how to turn those bytes into "values" appropriately.
On the CPU side, this mostly relates to two things: the registers and the instruction set (look at x86 for example).
The register is just a small chunk of memory that is closest to the CPU. Values are put there and used there for doing basic operations.
The instruction set will include a set of fixed names (like EAX, AX, etc.) for addressing memory slots on the register. Depending on the name, they can refer to shorter or longer slots (e.g. 8 bits, 16, 32, 64, etc.). Corresponding to those registers, there are operations (like addition, multiplication, etc.) which act on register values of certain size too. How the CPU actually executes the instructions or even stores the registers is not relevant (it's at the discretion of the manufacturer of the CPU), and it's up to the programmer (or compiler) to use the instruction set correctly.
The CPU itself has no idea what it's doing (it's not "intelligent") it just does the operations as they are requested. The compiler is the one which keeps track of the types of the variables and makes sure that the instructions that are generated and later executed by the program correspond to what you have coded (that's called "compilation"). But once the program is compiled, the CPU doesn't "keep track" of the types or sizes or anything like that (it would be too expensive to do so). Since compilers are pretty much guaranteed to produce instructions that are consistent, this is not an issue. Of course, if you programmed your own code in assembly and used mismatching registers and instructions, the CPU still wouldn't care, it would just make you program behave very weird (likely to crash).
Internally, a CPU may be wired to fetch 32 bits for an integer, which translates into 4 8-bit octets (bytes). The CPU does not regard the fetch as 4 bytes, but rather 32 bits.
The CPU is also internally wired to fetch 8 bits for a character (byte). In many processor architectures, the CPU fetches 32 bits from memory and internally ignores the unused bits (keeping the lowest 8 bits). This simplifies the processor architecture by only requiring fetches of 32 bits.
In efficient platforms, the memory is also accessible in 32-bit quantities. The data flow from the memory to the processor is often called the databus. In this description it would be 32 bits wide.
Other processor architectures can fetch 8 bits for a character. This removes the need for the processor to ignore 3 bytes from a 32-bit fetch.
Some programmers view integers in widths of bytes rather than bits. Thus a 32-bit integer would be thought of as 4 bytes. This can create confusion especially with bit ordering, a.k.a. Endianess. Some processors have the first byte containing the most significant bits (Big Endian), while others have the first byte representing the least significant bits (Little Endian). This leads to problems when transferring binary data between platforms.
Knowing that a processor's integer can hold 4 bytes and that it fetches 4 bytes at a time, many programmers like to pack 4 characters into an integer to improve performance. Thus the processor would require 1 fetch for 4 characters rather than 4 fetches for 4 characters. This performance improvement may be wasted by the execution time required to pack and unpack the characters from the integer.
In summary, I highly suggest you forget about how many bytes make up an integer and any relationship to a quantity of characters or bytes. This concept is only relevant on a few embedded platforms or a few high performance applications. Your objective is to deliver correct and robust code within a given duration. Performance and size concerns are at the end of a project, only tended if somebody complains. You will do fine in your career if you concentrate on the range and limitations of an integer rather than how much memory it occupies.