Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I recently posted a question about unaligned memory access, but given the answer, I am a little lost. I often hear that "aligned memory access is far more efficient than unaligned access", but I am actually not sure what is unaligned memory. Consequently:
What is unaligned memory?
How you declare something unaligned in C++? (small example program)
How do you access and manipulate something unaligned in C++? (small example program)
Is there is a way to manipulate unaligned memory in with a defined behavior approach or all of that is platform dependent/undefined behavior in C++?
Whether something is unaligned or not depends on the data type and its size As the answer from Gregg explains.
A well-written program usually does not have unaligned memory access, except when the compiler introduces it. (Yes, that happens during vectorization but let's skip that).
But you can write program in C++ to force unaligned memory access. The code below does just that.
#include <iostream>
using namespace std;
int main() {
int a[3] {1, 2, 3};
cout << *((long long *)(&a[0])) << endl;
cout << *((long long *)(&a[1])) << endl;
cout << (long long) (&a[0]) << endl;
cout << (long long) (&a[1]) << endl;
return 0;
}
The output of the code is this
8589934593
12884901890
70367819479584
70367819479588
What this program does?
I declare an integer array of size 3. This array will be 4 byte aligned because int is a 4 byte data type (at least on my platform). So the address of a[0] is divisible by 4. Now address of both of a[0] and a[1] is divisible by 4 but only address of one of them is divisible by 8.
So if I cast the address of a[0] and a[1] to a pointer to long long (which is an 8 byte data type on my platform) and then deference these two pointers, one of them will be an unaligned memory access. This is not undefined behavior AFAIK, but it is going to be slower than aligned memory access.
As you see this code contains C style casts which is not a good practice. But I think for enforcing some strange behavior that is fine.
let me know if you have question about the output of the code. You should know about endianness and representation of integers to understand the first two lines. The third and fourth line are address of the first two elements of the integer array. That should be easier to understand.
taking an example of a 32 bit computer reading a 4 byte word of data:
In the hardware, a 32 bit computer reads 4 bytes at a time but only on every 4th byte. This is because the memory bus is 4 bytes wide.
If your 4 byte data does not start on one of those 4 byte boundaries, the computer must read the memory twice, and then assemble the 4 bytes to a single register internally.
Based on the architecture chosen, the compiler knows this and places/pads data structures so that two byte data occur on two byte boundaries, 4 byte data starts on 4 byte boundaries, etc. This is specifically to avoid mis-aligned reads.
You can get misaligned reads if you read data in as bytes (like from a serial protocol) and then access them as 32 bit words. Avoid this in speed critical code. Usually it is taken care of for you and is not a problem.
Related
I use memcpy() to write data to a device, with a logic analyzer/PCIe analyzer, I can see the actual stores.
My device gets more stores than expected.
For example,
auto *data = new uint8_t[1024]();
for (int i=0; i<50; i++){
memcpy((void *)(addr), data, i);
}
For i=9, I see these stores:
4B from byte 0 to 3
4B from byte 4 to 7
3B from byte 5 to 7
1B-aligned only, re-writing the same data -> inefficient and useless store
1B the byte 8
In the end, all the 9 Bytes are written but memcpy creates an extra store of 3B re-writing what it has already written and nothing more.
Is it the expected behavior? The question is for C and C++, I'm interested in knowing why this happens, it seems very inefficient.
Is it the expected behavior?
The expected behavior is that it can do anything it feels like (including writing past the end, especially in a "read 8 bytes into a register, modify the first byte in the register, then write 8 bytes" way) as long as the result works as if the rules for the C abstract machine were followed.
Using a logic analyzer/PCIe analyzer to see the actual stores is so far beyond the scope of "works as if the rules for the abstraction machine were followed" that it's unreasonable to have any expectations.
Specifically; you can't assume the writes will happen in any specific order, can't assume anything about the size of any individual write, can't assume writes won't overlap, can't assume there won't be writes past the end of the area, can't assume writes will actually occur at all (without volatile), and can't even assume that CHAR_BIT isn't larger than 8 (or that memcpy(dest, source, 10); isn't asking to write 20 octets/"8 bit bytes").
If you need guarantees about writes, then you need to enforce those guarantees yourself (e.g. maybe create a structure of volatile fields to force the compiler to ensure writes happen in a specific order, maybe use inline assembly with explicit fences/barriers, etc).
The following illustrates why memcpy may be implemented this way.
To copy 9 bytes, starting at a 4-byte aligned address, memcpy issues these instructions (described as pseudo code):
Load four bytes from source+0 and store four bytes to destination+0.
Load four bytes from source+4 and store four bytes to destination+4.
Load four bytes from source+5 and store four bytes to destination+5.
The processor implements the store instructions with these data transfer in the hardware:
Since destination+0 is aligned, store 4 bytes to destination+0.
Since destination+4 is aligned, store 4 bytes to destination+4.
Since destination+5 is not aligned, store 3 bytes to destination+3 and store 1 byte to destination+8.
This is an easy and efficient way to write memcpy:
If length is less than four bytes, jump to separate code for that.
Loop copying four bytes until fewer than four bytes are left.
if length is not a multiple of four, copy four bytes from source+length−4 to destination+length−4.
That single step to copy the last few bytes may be more efficient than branching to three different cases with various cases.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am not a C++ programmer, but this algorithm appeared in an operating manual for a machine I'm using and I'm struggling to make sense of it. I'd like someone to explain some terms within it, or possibly the flow of the entire code, given that I don't have time to learn C in the course of the project.
Waveform files for the machine in question are made up of a number of tags in curly brackets. The checksum is calculated using the WAVEFORM tag, {WAVEFORM-length: #data}.
The "data" consists of a number of bytes represented as hexadecimal numbers. "length" is the number of bytes in the "data", while "start" apparently points to the first byte in the "data".
I've managed to work out some of the terms, but I'm particularly unsure about my interpretation of ((UINT32 *)start)[i]
UINT32 checksum(void *start, UINT32 length)
{
UINT32 i, result = 0xA50F74FF;
for(i=0; i < length/4; i++)
result = result ^ ((UINT32 *)start)[i];
return(result);
}
So from what I can tell, the code does the following:
Take the address of the first byte in the "data" and the length of the "data"
Create a variable called result, which is an unsigned integer A50F74FF
For each byte in the first 25% of the data string, raise "result" to that power (presumably modulo 2^32)
Return result as the value checksum
Am I correct here or have I misread the algorithm on one of the steps? I feel like I can't be correct, because basing a checksum on only part of the data wouldn't spot errors in the later parts of the data.
For each byte in the first 25% of the data string, raise "result" to that power (presumably modulo 2^32)
This is wrong. ^ is the bitwise XOR operation. It does not raise to a power.
Also, about "of the data string". The algorithm iterates the pointed data as if it is an array of UINT32. In fact, if start doesn't point to (an element of) an array of UINT32, then the behaviour of the program is undefined1. It would be much better to declare the argument to be UINT32* in the first place, and not use the explicit cast.
Also, about "For each byte in the first 25% of the data string", the algorithm appears to go through (nearly2) all bytes from start to start + length. length is presumably measured in bytes, and UINT32 is presumably a type that consists of 4 bytes. Thus an array of UINT32 objects of N bytes contains N/4 elements UINT32 of objects. Note that this assumes that the byte is 8 bits wide which is probably an assumption that the manual can make, but keep in mind that it is not an assumption portable to all systems.
1 UB as far as the C++ language is concerned. But, if it's shown in the operating manual of a machine, then perhaps the special compiler for the particular hardware specifies defined behaviour for this. That said, it is also quite possible for the author of the manual to have made a mistake.
2 If length is not divisible by 4, then the remaining 1-3 bytes are not used.
So the pseudocode for this function is roughly like this:
function checksum(DATA)
RESULT = 0xA50F74FF;
for each DWORD in DATA do
RESULT = RESULT xor DWORD
return RESULT
where DWORD is a four-byte integer value.
The function is actually going though (almost) all of the data (not 25%) but it's doing it in 4-byte increments that's why the length which is in bytes is divided by 4.
After some readings, I understand that compiler has done the padding for structs or classes such that each member can be accessed on its natural aligned boundary. So under what circumstance is it necessary for coders to make explicit alignment to achieve better performance? My question arises from here:
Intel 64 and IA-32 Architechtures Optimization Reference Manual:
For best performance, align data as follows:
Align 8-bit data at any address.
Align 16-bit data to be contained within an aligned 4-byte word.
Align 32-bit data so that its base address is a multiple of four.
Align 64-bit data so that its base address is a multiple of eight.
Align 80-bit data so that its base address is a multiple of sixteen.
Align 128-bit data so that its base address is a multiple of sixteen.
So suppose I have a struct:
struct A
{
int a;
int b;
int c;
}
// size = 12;
// aligned on boundary of: 4
By creating an array of type A, even if I do nothing, it is properly aligned. Then what's the point to follow the guide and make the alignment stronger?
Is it because of cache line split? Assuming the cache line is 64 bytes. With the 6th access of object in the array, the byte starts from 61 to 72, which slows down the program??
BTW, is there a macro in standard library that tells me the alignment requirement based on the running machine by returning a value of std::size_t?
Let me answer your question directly: No, there is no need to explicitly align data in C++ for performance.
Any decent compiler will properly align the data for underlying system.
The problem would come (variation on above) if you had:
struct
{
int w ;
char x ;
int y ;
char z ;
}
This illustrates the two common structure alignment problems.
(1) It is likely a compiler would insert (2) 3 alignment bytes after both x and z. If there is no padding after x, y is unaligned. If there is no padding after z, w and x will be unaligned in arrays.
The instructions are you are reading in the manual are targeted towards assembly language programmers and compiler writers.
When data is unaligned, on some systems (not Intel) it causes an exception and on others it take multiple processor cycles to fetch and write the data.
The only time I can thing of when you want explicit alignment is when you are directly copying/casting data between your struct to a char* for serialization in some type of binary protocol.
Here unexpected padding may cause problems with a remote user of your protocol.
In pseudocode:
struct Data PACKED
{
char code[3];
int val;
};
Data data = { "AB", 24 };
char buf[20];
memcpy(buf, data, sizeof(data));
send (buf, sizeof(data);
Now if our protocol expects 3 octets of code followed by a 4 octet integer value for val, we will run into problems if we use the above code. Since padding will introduce problems for us. The only way to get this to work is for the struct above to be packed (allignment 1)
There is indeed a facility in the language (it's not a macro, and it's not from the standard library) to tell you the alignment of an object or type. It's alignof (see also: std::alignment_of).
To answer your question: In general you should not be concerned with alignment. The compiler will take care of it for you, and in general/most cases it knows much, much better than you do how to align your data.
The only case where you'd need to fiddle with alignment (see alignas specifier) is when you're writing some code which allows some possibly less aligned data type to be the backing store for some possibly more aligned data type.
Examples of things that do this under the hood are std::experimental::optional and boost::variant. There's also facilities in the standard library explicitly for creating such a backing store, namely std::aligned_storage and std::aligned_union.
By creating an array of type A, even if I do nothing, it is properly aligned. Then what's the point to follow the guide and make the alignment stronger?
The ABI only describes how to use the data elements it defines. The guideline doesn't apply to your struct.
Is it because of cache line split? Assuming the cache line is 64 bytes. With the 6th access of object in the array, the byte starts from 61 to 72, which slows down the program??
The cache question could go either way. If your algorithm randomly accesses the array and touches all of a, b, and c then alignment of the entire structure to a 16-byte boundary would improve performance, because fetching any of a, b, or c from memory would always fetch the other two. However if only linear access is used or random accesses only touch one of the members, 16-byte alignment would waste cache capacity and memory bandwidth, decreasing performance.
Exhaustive analysis isn't really necessary. You can just try and see what alignas does for performance. (Or add a dummy member, pre-C++11.)
BTW, is there a macro in standard library that tells me the alignment requirement based on the running machine by returning a value of std::size_t?
C++11 (and C11) have an alignof operator.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Bit fields in a structure can be used to save some bytes of memory, I have heard. How can we use this particular bytes for any purposes?
typedef struct
{
char A : 1;
int B : 1;
} Struct1;
The char value itself with a width of one bit is not particularly useful. In fact char values as bit fields are not standard. It's an extension that Microsoft added. On the other hand the B field can be used as on / off value since it can hold the values 1 or 0
Struct1 s;
s.B = 0;
if (s.B) {
...
}
This particular example doesn't really demonstrate the savings offered by bit fields particularly well. Need a more complex strut for that. Consider the following
typedef struct {
int Value1;
int Value2;
} S1;
On most platforms S1 will have a size of 8 (each int field being 4 bytes long). Imagine though that Value1 and Value2 will always have values between 0 and 10. This could be stored in 4 bits but we're using 32 bits meaning. Using bit fields we could reduce the waste significantly here
typedef struct {
int Value1 : 4;
int Value2 : 4;
} S1;
Now the size of S1 is likely 1 byte and can still hold all of the necessary values
In embedded systems, the bit fields in a structure can be used to represent bit fields or a hardware device.
Other uses for bit fields are in protocols (messages). One byte (or 4 bytes) to represent the presences or absences of many things would occupy a lot a space and wasted transmission time. So in 1 byte you could represent 8 Boolean conditions rather than using 8 bytes or 8 words to do so.
The bit fields in a structure are usually used as convenience. The same operations to extract, set or test bit fields can be performed using the arithmetic bit operators (such as AND).
Saving memory means, that you need less memory. This can increase the program performance due to less swapping or cache misses.
In your example the two parts A and B would be stored in a byte (or whatever the compiler decides to use), instead of two. A better example is, if you want to store the occupied seats in an opera house with 1000 seats. If you would store them as a boolean which is often stored in a byte per boolean, they could be stored in 128 bytes, because per seat only one bit is needed.
The downside is the performance loss. Accessing the bits need some additional shifting or xor-ing. Thus, it's a trade-off memory for computation time.
What I understood about char type from a few questions asked here is that it is always 1 byte in C++, but number of bits can vary from system to system.
sizeof() operator uses char as a unit so sizeof(char) is always 1 in bytes of C++.(which takes number of bits of smallest unit of address of local machine) If when using file functions of fstream() in binary mode, we directly read and write from/to an address of any variable in RAM, the size of variable as smallest unit of data written to file should be in size of the value read from RAM and for one read from file it is vice-versa. Then can we say that data may not be written 8 by 8 in bits if something like this is tried:
ofstream file;
file.open("blabla.bin",ios::out|ios::binary);
char a[]="asdfghjkkll";
file.seekp(0);
file.write((char*)a,sizeof(a)-1);
file.close();
Unless char is always used in bytes existing standard 8 bits, what happens if a heap of data is written to file in a 16 bit machine and is read in a 32 bit machine? Or should I use OS-dependent text mode? If not, and I misunderstood what is truth?
Edit : I have corrected my mistake.
Thanks for warning.
Edit2: My system is 64 bit but I get number of bits of char type as 8.What is wrong? Is the way I get the result of 8false?
I got a 00000... by shifting a char variable more than possible size of it with bitwise operators.After guaranteeing that all bits of the variable is zero, I got a 111... by inverting it. And shifted until it become zero.If we shift it its size time, we get a zero, so we can get number of bits from indice of the loop terminated below.
char zero,test;
zero<<=64; //hoping that system is not more than 64 bit(most likely)
test=~zero; //we have a 111...
int i;
for(i=0; test!=zero; i++)
test=test<<1;
Value of variable of i after the loop is number of bits in char type.According to this, the result is 8.
My last question is:
Are filesystem byte and char type different data types because how computer adresses pointers in file stream is different from standart char type which is at least 8 bits?
So, exactly what is going on the background?
Edit3: Why these minuses? What is my mistake? Isn't the question clear enough? Maybe my question is stupid but why there is no any response related to my question?
A language standard can't really specify what the filesystem does - it can only specify how the language interacts with it. The C and C++ standards also don't address anything to do with interoperability or communication between different implementations. In other words, there isn't a general answer to this question except to say that:
the VAST majority of systems use 8-bit bytes
the C and C++ standard require that char is at least 8 bits
it is very likely that greater-than-8-bit systems have mechanisms in place to somehow utilize (or at least transcode) 8-bit files.