Why is memory alignment required? [duplicate] - c++

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Purpose of memory alignment
I read some articles on net about memory alignment and could understand that from properly aligned memory (take 2-byte alignment) we can fetch data fastly in one go.
But if we have memory like a single hardware piece, then given an address, why cannot we read 2-byte directly from that position. like:
I thought over it. I think that if the memory is in odd-even banks kind of then the theory would apply.
What am i missing ?

Your pictures describe how we (humans) visualize computer memory.
In reality, think about memory as huge matrix of bits.
Each matrix column has a "reader" attached that can read/write any bit from this column.
Each matrix row has a "selector", which can select the specific bit that the reader will read/write.
Therefore, this reader can read the whole selected matrix row at once.
Length of this row (number of matrix columns) define how much data can be read at once.
For instance, if you have 64 columns, your memory controller can read 8 bytes at once (it usually can do more than that though).
As long as you keep your data aligned, you will need less of these memory accesses.
Even if you need to read just two bits, but they are located on different rows, you will need two accesses to memory instead of one.
Also, there's a whole aspect of writing, which is a different problem.
Just as you can read the whole row, you also can write the whole row.
If your data isn't aligned, when you write something that is not a full row, you will need to do read-modify-write (read the old content of the row, modify the relevant part and write the new content).

Data from memory is typically delivered to the processor on a set of wires that matches the bus width. E.g., if the bus is 32 bits wide, there are 32 data wires going from the bus into the processor (along with other wires for control signals).
Inside the processor, various wires and switches deliver this data to wherever it is needed. If you read 32 aligned bits into a register, the wires can deliver the data very directly to a register (or other holding location).
If you read 8 or 16 aligned bits into a register, the wires can deliver the data the same way, and the other bits in the register are set to zero.
If you read 8 or 16 unaligned bits into a register, the wires cannot deliver the data directly. Instead, the bits must be shifted: They must go through a different set of wires, so that they can be “moved over” to line up with the wires going into the register.
In some processors, the designers have put additional wires and switches to do this moving. This can be very expensive in terms of the amount of silicon it takes. You need a lot of extra wires and switches in order to be able to move any possible unaligned bytes to desired locations. Because this is so expensive, in some processors, there is not a full shifter that can do all shifts immediately. Instead, the shifter might be able to move bits only by a byte or so per CPU cycles, and it takes several cycles to shift by several bytes. In some processors, there are no wires for this at all, so all loads and stores must be aligned.

In first case(single piece of hardware), if you need to read 2 bytes then the processor will have to issue two read cycles, this is because memory is byte-addressable i.e each byte is provided a unique address.
Organizing memory as banks help the CPU to fetch more data into registers in a single read cycles. This technique helps in reducing read cycles-which is a very slow process as compared to CPU's processing capacity. Thus, for a single read cycle you can read more amount of data.

Related

why buffers should be aligned on 64-byte boundary for best performance?

In this example program I've found this note:
/* Hardware delivers at most ef_vi_receive_buffer_len() bytes to each
* buffer (default 1792), and for best performance buffers should be
* aligned on a 64-byte boundary. Also, RX DMA will not cross a 4K
* boundary. The I/O address space may be discontiguous at 4K boundaries.
* So easiest thing to do is to make buffers always be 2K in size.
*/
#define PKT_BUF_SIZE 2048
I'm interested why for best performance buffers should be aligned on a 64-byte boundary? Why, for example 2000 buffers are slower than 2048 buffers? I guess this is how 64-bit computer works - by some reason it's faster to memcpy 2048 bytes than 2000 bytes?
Why exactly 2048 buffers are faster and may be you can link "minimal example" where "bigger but 64-byte aligned" buffers are faster?
64 bytes is a popular size of a cache line on contemporary architectures. Any fetch from memory fetches entire cache lines. By aligning data to the cache line boundaries, you minimize the number of cache lines that need to be fetched to read your data and that are dirtied when you write your data.
Of course the size of your data is important, too. For example, if the size of the data divides the size of the cache line, it's perfectly fine to align only on the size.
By contrast, suppose your data is 96 bytes large. If you align on 32 bytes, you may use up to three cache lines:
|............DDDD|DDDDDDDDDDDDDDDD|DDDD............|
By contrast, if you align on 64 bytes (necessitating another 32 bytes of padding), you only ever need two cache lines:
|................|DDDDDDDDDDDDDDDD|DDDDDDDDPPPPPPPP|
(D = data, P = padding, each character represents 4 bytes.)
Cache lines are even more of a concern when you modify memory concurrently. Every time you dirty one cache line, all other CPUs that have fetched the same cache line may potentially have to discard and refetch those. Accidentally placing unrelated, shared data on the same cache line is known as "false sharing", and the insertion of padding is usually used to avoid that.
The short answer is that a data cache line on most contemporary x64 processors is 64 bytes wide, so every fetch that a CPU does from main memory is 64 bytes at a time. If you're loading a 64-byte struct that straddles the 64-byte boundary, then the CPU has to fetch two cache lines to get the whole struct.
The real answer is that this too complex a topic to fit into an answer box, but Ulrich Drepper's excellent "What Every Programmer Should Know About Memory" paper will give you a complete explanation.
Also note that the 64-byte thing isn't a basic law of computing nor is it related to 64-bit processors. It just happens to be the most common cache line size on the x64 processors that are in most workstations today. Other processors have different cache line sizes (for example, the Xenon PowerPC used in the Xbox360 and PS3 has a 128-byte cache line).

why 128bit variables should be aligned to 16Byte boundary

As we know, X86 CPU has a 64bit data bus. My understanding is that CPU can't access to arbitrary address. The address that CPU could access to is a integral multiple of the width of its data bus. For the performance, variables should start at(aligned to) these addresses to avoid extra memory access. 32bit variables aligned to 4Byte boundry will be automatically aligned to 8Byte(64bit) boundry, which corresponds to x86 64bit data bus. But why compilers align 128bit variables to 16Byte boundry? Not the 8Byte boundry?
Thanks
Let me make things more specific. Compilers use the length of a variable to align it. For example, if a variable has 256bit length, Complier will align it to 32Byte boundry. I don't think there is any kind of CPU has that long data-bus. Furthermore, common DDR memories only transfer 64bit data one time, despite of the cache, how could a memory fill up CPU's wider data-bus? or only by means of cache?
One reasons is that most SSE2 instructions on X86 require the data to be 128 bit aligned. This design decision would have been made for performance reasons and to avoid overly complex (and hence slow and big) hardware.
There are so many different processor models that I am going to answer this only in theoretical and general terms.
Consider an array of 16-byte objects that starts at an address that is a multiple of eight bytes but not of 16 bytes. Let’s suppose the processor has an eight-byte bus, as indicated in the question, even if some processors do not. However, note that at some point in the array, one of the objects must straddle a page boundary: Memory mapping commonly works in 4096-byte pages that start on 4096-byte boundaries. With an eight-byte-aligned array, some element of the array will start at byte 4088 of one page and continue up to byte 7 of the next page.
When a program tries to load the 16-byte object that crosses a page boundary, it can no longer do a single virtual-to-physical memory map. It has to do one lookup for the first eight bytes and another lookup for the second eight bytes. If the load/store unit is not designed for this, then the instruction needs special handling. The processor might abort its initial attempt to execute the instruction, divide it into two special microinstructions, and send those back into the instruction queue for execution. This can delay the instruction by many processor cycles.
In addition, as Hans Passant noted, alignment interacts with cache. Each processor has a memory cache, and it is common for cache to be organized into 32-byte or 64-byte “lines”. If you load a 16-byte object that is 16-byte aligned, and the object is in cache, then the cache can supply one cache line that contains the needed data. If you are loading 16-byte objects from an array that is not 16-byte aligned, then some of the objects in the array will straddle two cache lines. When these objects are loaded, two lines must be fetched from the cache. This may take longer. Even if it does not take longer to get two lines, perhaps because the processor is designed to provide two cache lines per cycle, this can interfere with other things that a program is doing. Commonly, a program will load data from multiple places. If the loads are efficient, the processor may be able to perform two at once. But if one of them requires two cache lines instead of the normal one, then it blocks simultaneous execution of other load operations.
Additionally, some instructions explicitly require aligned addresses. The processor might dispatch these instructions more directly, bypassing some of the tests that fix up operations without aligned addresses. When the addresses of these instructions are resolved and are found to be misaligned, the processor must abort them, because the fix-up operations have been bypassed.

Why do integers process faster than bytes on NDS?

i've noticed that my nds application works a little faster when I replace all the instances of bytes with integers. all the examples online put u8/u16 instances whenever possible. is there a specific reason as to why this is the case?
The main processor the Nintendo DS utilizes is ARM9, a 32-bit processor.
Reference: http://en.wikipedia.org/wiki/ARM9
Typically, CPU will conduct operations in word sizes, in this case, 32-bits. Depending on your operations, having to convert the bytes up to integers or vice-versa may be causing additional strain on the processor. This conversion and the potential lack of instructions for values other than 32-bit integers may be causing the lack of speed.
Complementary to what Daniel Li said, memory access on ARM platforms must be word aligned, i.e. memory fetches must be multiple of 32 bits. Fetching a byte variable from memory implies in fetching the whole word containing the relevant byte, and performing the needed bit-wise operations to fit it in the least significant bits of the processor register.
Theses extra instructions are automatically emitted by the compiler, given it knows the actual alignment of your variables.

What would be an ideal buffer size? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you determine the ideal buffer size when using FileInputStream?
When reading raw data from a file (or any input stream) using either the C++'s istream family's read() or C's fread(), a buffer has to be supplied, and a number of how much data to read. Most programs I have seen seem to arbitrarily chose a power of 2 between 512 and 4096.
Is there a reason it has to/should be a power of 2, or this just programer's natural inclination to powers of 2?
What would be the "ideal" number? By "ideal" I mean that it would be the fastest. I assume it would have to be a multiple of the underlying device's buffer size? Or maybe of the underlying stream object's buffer? How would I determine what the size of those buffers is, anyway? And once I do, would using a multiple of it give any speed increase over just using the exact size?
EDIT
Most answers seem to be that it can't be determined at compile time. I am fine with finding it at runtime.
SOURCE:
How do you determine the ideal buffer size when using FileInputStream?
Optimum buffer size is related to a number of things: file system
block size, CPU cache size and cache latency.
Most file systems are configured to use block sizes of 4096 or 8192.
In theory, if you configure your buffer size so you are reading a few
bytes more than the disk block, the operations with the file system
can be extremely inefficient (i.e. if you configured your buffer to
read 4100 bytes at a time, each read would require 2 block reads by
the file system). If the blocks are already in cache, then you wind up
paying the price of RAM -> L3/L2 cache latency. If you are unlucky and
the blocks are not in cache yet, the you pay the price of the
disk->RAM latency as well.
This is why you see most buffers sized as a power of 2, and generally
larger than (or equal to) the disk block size. This means that one of
your stream reads could result in multiple disk block reads - but
those reads will always use a full block - no wasted reads.
Ensuring this also typically results in other performance friendly parameters affecting both reading and subsequent processing: data bus width alignment, DMA alignment, memory cache line alignment, whole number of virtual memory pages.
At least in my case, the assumption is that the underlying system is using a buffer whose size is a power of two, too, so it's best to try and match. I think nowadays buffers should be made a bit bigger than what "most" programmers tend to make them. I'd go with 32 KB rather than 4, for instance.
It's very hard to know in advance, unfortunately. It depends on whether your application is I/O or CPU bound, for instance.
I think that mostly it's just choosing a "round" number. If computers worked in decimal we'd probably choose 1000 or 10000 instead of 1024 or 8192. There is no very good reason.
One possible reason is that disk sectors are usually 512 bytes in size so reading a multiple of that is more efficient, assuming that all the hardware layers and caching cause the low level code to actually be able to use this fact efficiently. Which it probably can't unless you are writing a device driver or doing an unbuffered read.
No reason that I know of that it has to be a power of two. You are constrained by the buffer size having to be within max size_t but this is unlikely to be an issue.
Clearly the bigger the buffer the better but this is obviously not scalable so some account must be taken of system resource considerations either at compile time or preferably at runtime.
1 . Is there a reason it has to/should be a power of 2, or this just programer's natural inclination to powers of 2?
Not really. It should probably be something that goes even in the size of the data bus width to simplify memory copy, so anything that divies into 16 would be good with the current technology. Using a power of 2 makes it likely that it will work well with any future technology.
2 . What would be the "ideal" number? By "ideal" I mean that it would be the fastest.
The fastest would be as much as possible. However, once you go over a few kilobytes you will have a very small performance difference compared to the amount of memory that you use.
I assume it would have to be a multiple of the
underlying device's buffer size? Or maybe of the underlying stream
object's buffer? How would I determine what the size of those buffers
is, anyway?
You can't really know the size of the underlying buffers, or depend on that they remain the same.
And once I do, would using a multiple of it give any speed
increase over just using the exact size?
Some, but very little.
I think Ideal size of Buffer is size of one block in your hard drive,so it can map properly with your buffer while storing or fetching data from hard drive.

How does more than one byte value is translated?

A character char maybe of size one byte but when it comes to four bytes value e.g int , how does the cpu differ it from an integer instead of four characters per byte?
The CPU executes code which you made.
This code tells the CPU how to treat the few bytes at a certain memory, like "take the four bytes at address 0x87367, treat them as an integer and add one to the value".
See, it's you who decide how to treat the memory.
Are you asking a question about CPU design?
Each CPU machine instruction is encoded so that the CPU knows how many bits to operate on.
The C++ compiler knows to emit 8-bit instructions for char and 32-bit instructions for int.
In general the CPU by itself knows nothing about the interpretation of values stored at certain memory locations, it's the code that is run (generated, in this case, by the compiler) that it's supposed to know it and use the correct CPU instructions to manipulate such values.
To say it in another way, the type of a variable is an abstraction of the language that tells to the compiler what code to generate to manipulate the memory.
Types in some way do exist at machine code level: there are various instructions to work on various types - i.e. the way the raw values stored in memory are interpreted, but it's up to the code executed to use the correct instructions to treat the values stored in memory correctly.
The compiler has table which named "symbols table", so the compiler know which type is every var and how it should regard it.
This depends on the architecture. Most systems use IEEE 754 Floating Point Representation and Two's Compliment for integer values, but it's up to the CPU in question. It knows how to turn those bytes into "values" appropriately.
On the CPU side, this mostly relates to two things: the registers and the instruction set (look at x86 for example).
The register is just a small chunk of memory that is closest to the CPU. Values are put there and used there for doing basic operations.
The instruction set will include a set of fixed names (like EAX, AX, etc.) for addressing memory slots on the register. Depending on the name, they can refer to shorter or longer slots (e.g. 8 bits, 16, 32, 64, etc.). Corresponding to those registers, there are operations (like addition, multiplication, etc.) which act on register values of certain size too. How the CPU actually executes the instructions or even stores the registers is not relevant (it's at the discretion of the manufacturer of the CPU), and it's up to the programmer (or compiler) to use the instruction set correctly.
The CPU itself has no idea what it's doing (it's not "intelligent") it just does the operations as they are requested. The compiler is the one which keeps track of the types of the variables and makes sure that the instructions that are generated and later executed by the program correspond to what you have coded (that's called "compilation"). But once the program is compiled, the CPU doesn't "keep track" of the types or sizes or anything like that (it would be too expensive to do so). Since compilers are pretty much guaranteed to produce instructions that are consistent, this is not an issue. Of course, if you programmed your own code in assembly and used mismatching registers and instructions, the CPU still wouldn't care, it would just make you program behave very weird (likely to crash).
Internally, a CPU may be wired to fetch 32 bits for an integer, which translates into 4 8-bit octets (bytes). The CPU does not regard the fetch as 4 bytes, but rather 32 bits.
The CPU is also internally wired to fetch 8 bits for a character (byte). In many processor architectures, the CPU fetches 32 bits from memory and internally ignores the unused bits (keeping the lowest 8 bits). This simplifies the processor architecture by only requiring fetches of 32 bits.
In efficient platforms, the memory is also accessible in 32-bit quantities. The data flow from the memory to the processor is often called the databus. In this description it would be 32 bits wide.
Other processor architectures can fetch 8 bits for a character. This removes the need for the processor to ignore 3 bytes from a 32-bit fetch.
Some programmers view integers in widths of bytes rather than bits. Thus a 32-bit integer would be thought of as 4 bytes. This can create confusion especially with bit ordering, a.k.a. Endianess. Some processors have the first byte containing the most significant bits (Big Endian), while others have the first byte representing the least significant bits (Little Endian). This leads to problems when transferring binary data between platforms.
Knowing that a processor's integer can hold 4 bytes and that it fetches 4 bytes at a time, many programmers like to pack 4 characters into an integer to improve performance. Thus the processor would require 1 fetch for 4 characters rather than 4 fetches for 4 characters. This performance improvement may be wasted by the execution time required to pack and unpack the characters from the integer.
In summary, I highly suggest you forget about how many bytes make up an integer and any relationship to a quantity of characters or bytes. This concept is only relevant on a few embedded platforms or a few high performance applications. Your objective is to deliver correct and robust code within a given duration. Performance and size concerns are at the end of a project, only tended if somebody complains. You will do fine in your career if you concentrate on the range and limitations of an integer rather than how much memory it occupies.