Optimize memory and performance for transmit data buffer - c++

I have a function which prepares a data buffer and then sends it via an externally provided API function which looks like:
send(uint8_t* data_buf, uint32_t length)
In my particular case, my code is always sending exactly 8 bytes and the first 7 bytes are always the same (I can't change this fact; it's some sort of message header).
Because I am in an limited, embedded environment, I would like to optimize the size and performance of my code, or at least choose the best tradeoff of the two.
Currently, I see two options:
Create a global array. Initialize the first 7 bytes one time and then
just overwrite the last byte before sending the array.
Create a local array, write all 8 bytes and then send it.
Are there any better solutions than the two mentioned above?

Even on an embedded system, you can count on having more than 8 bytes of cache. Hence performance really doesn't matter. databuf[8] will be fully in cache in both cases. The code size for the first case might be smaller by 2-3 instructions.
(There's often another issue: code/flash size and RAM size are distinct constraints, when you can XIP).

Related

How does std::alignas optimize the performance of a program?

In 32-bit machine, One memory read cycle gets 4 bytes of data.
So for reading below buffer, It should take 32 read-cycle to read a buffer of 128 bytes mentioned below.
char buffer[128];
Now, Suppose if I have aligned this buffer as mentioned below then please let me know how will it make it faster to read?
alignas(128) char buffer[128];
I am assuming the memory read cycle will remain 4 bytes only.
The size of the registers used for memory access is only one part of the story, the other part is the size of the cache-line.
If a cache-line is 64 bytes and your char[128] is naturally aligned, the CPU generally needs to manipulate three different cache-lines. With alignas(64) or alignas(128), only two cache-lines need to be touched.
If you are working with memory mapped file, or under swapping conditions, the next level of alignment kicks in: the size of a memory page. This would call for 4096 or 8192 byte alignments.
However, I seriously doubt that alignas() has any significant positive effect if the specified alignment is larger than the natural alignment that the compiler uses anyway: It significantly increases memory consumption, which may be enough to trigger more cache-lines/memory pages being touched in the first place. It's only the small misalignments that need to be avoided because they may trigger huge slowdowns on some CPUs, or might be downright illegal/impossible on others.
Thus, truth is only in measurement: If you need all the speedup you can get, try it, measure the runtime difference, and see whether it works out.
In 32 bit machine, One memory read cycle gets 4 bytes of data.
It's not that simple. Just the term "32 bit machine" is already too broad and can mean many things. 32b registers (GP registers? ALU registers? Address registers?)? 32b address bus? 32b data bus? 32b instruction word size?
And "memory read" by whom. CPU? Cache? DMA chip?
If you have a HW platform where memory is read by 4 bytes (aligned by 4) in single cycle and without any cache, then alignas(128) will do no difference (than alignas(4)).

Data align when socket recv() then written to file using overlapped_io with FILE_NO_BUFFERING_FLAG

I'm writing a C++ program that simply receives data from another computer and writes the data into an SSD RAID with high throughput (about 100MB/s since GbEthernet).
I have set up 2 overlapped_io each, which are received from Ethernet and written to SSD.
When the receiving is done done, it'll post a message to the writer.
And I use FILE_NO_BUFFERING_FLAG when creating the file on disk.
On the side of network sender, I am using an overlapped IO to send data.
I got stuck in the problem: when received from the socket, the rv = recv() is not aligned with the disk (maybe 4096 times?).
What should I do?
recv and unbuffered writes are not really very compatible with each other. It is possible to get that working, but it will take a little extra work.
When doing unbuffered writes, both the start address of your buffer and the amount to write must be multiples of the sector size (see MSDN). Aligning the buffer is trivial, but dealing with the fact that recv can return pretty much every amount of data (up to the amount you ask for, but in theory it could be just 1 byte) is a bit of work.
Another problem is that while it is pretty much guaranteed that the sector size is a power of two (though at least there used to exist harddisks with non-power-of-two sectors in the 1990s, this fact was hidden by the controller) you do not know what it is. And even if you did know, it might be different on the next computer. It might be 512 or 1024 or something else.
How to handle this? Most programmers resort to simply using a function that allocates complete memory pages, such as VirtualAlloc, or an anonymous memory mapping. Since these operate on pages, they are necessarily page-size aligned, which (usually) means 4096 bytes1.
Since the amount of data to write must, too, be a multiple of the sector size (but the amount of data received probably isn't), you have round down, do a partial write, and save the rest for the next write.
Again, the problem is that you don't know the sector size, so the best thing you can do is round down to the same granularity that you're using for the buffer start (anything else would be nonsensical). In other words, you conceptually have to do something like this:
while(rv < 0xffff) // don't have enough yet
receive_more_and_append();
num_write = rv & ~0xffff;
rv -= num_write;
memcpy(other_buf, buf+num_write, rv);
WriteFileEx(...);
1That is only half the truth, since Windows has a minimum allocation granularity of 64kB. You can't allocate something smaller than 64k and it can't be aligned less than 64k. So in fact, you are good for sectors up to 64k, which is bigger than anything you are likely to ever encounter, realistically.
Also, as a small nitpick, Itanium has 8k pages, not 4k -- but that is no problem, it's actually better.

Variable Length Array Performance Implications (C/C++)

I'm writing a fairly straightforward function that sends an array over to a file descriptor. However, in order to send the data, I need to append a one byte header.
Here is a simplified version of what I'm doing and it seems to work:
void SendData(uint8_t* buffer, size_t length) {
uint8_t buffer_to_send[length + 1];
buffer_to_send[0] = MY_SPECIAL_BYTE;
memcpy(buffer_to_send + 1, buffer, length);
// more code to send the buffer_to_send goes here...
}
Like I said, the code seems to work fine, however, I've recently gotten into the habit of using the Google C++ style guide since my current project has no set style guide for it (I'm actually the only software engineer on my project and I wanted to use something that's used in industry). I ran Google's cpplint.py and it caught the line where I am creating buffer_to_send and threw some comment about not using variable length arrays. Specifically, here's what Google's C++ style guide has to say about variable length arrays...
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Variable-Length_Arrays_and_alloca__
Based on their comments, it appears I may have found the root cause of seemingly random crashes in my code (which occur very infrequently, but are nonetheless annoying). However, I'm a bit torn as to how to fix it.
Here are my proposed solutions:
Make buffer_to_send essentially a fixed length array of a constant length. The problem that I can think of here is that I have to make the buffer as big as the theoretically largest buffer I'd want to send. In the average case, the buffers are much smaller, and I'd be wasting about 0.5KB doing so each time the function is called. Note that the program must run on an embedded system, and while I'm not necessarily counting each byte, I'd like to use as little memory as possible.
Use new and delete or malloc/free to dynamically allocate the buffer. The issue here is that the function is called frequently and there would be some overhead in terms of constantly asking the OS for memory and then releasing it.
Use two successive calls to write() in order to pass the data to the file descriptor. That is, the first write would pass only the one byte, and the next would send the rest of the buffer. While seemingly straightforward, I would need to research the code a bit more (note that I got this code handed down from a previous engineer who has since left the company I work for) in order to guarantee that the two successive writes occur atomically. Also, if this requires locking, then it essentially becomes more complex and has more performance impact than case #2.
Note that I cannot make the buffer_to_send a member variable or scope it outside the function since there are (potentially) multiple calls to the function at any given time from various threads.
Please let me know your opinion and what my preferred approach should be. Thanks for your time.
You can fold the two successive calls to write() in your option 3 into a single call using writev().
http://pubs.opengroup.org/onlinepubs/009696799/functions/writev.html
I would choose option 1. If you know the maximum length of your data, then allocate that much space (plus one byte) on the stack using a fixed size array. This is no worse than the variable length array you have shown because you must always have enough space left on the stack otherwise you simply won't be able to handle your maximum length (at worst, your code would randomly crash on larger buffer sizes). At the time this function is called, nothing else will be using the further space on your stack so it will be safe to allocate a fixed size array.

C++ vector out of memory

I have a very large vector (millions of entries 1024 bytes each). I am exceeding the maximum size of the vector (getting a bad memory alloc exception). I am doing a recursive operation over the vector of items which requires accessing other elements in the vector. The operations need to be done quickly. I am trying to avoid writing to disk for speed reasons. Is there any other way to store this data that would not require writing to disk? If I have to write the data to disk, what would be the most ideal way to do it>
edit for a few more details.
The operations that I am performing on the data set is generating a string recursively based on other data points in the vector. The data is sorted when it is read in. Data sets ranging from 50,000 to 50,000,0000.
The easiest way to solve this problem is to use STXXL. It's a reimplementation of the STL for large structures that transparently writes to disk when the data won't fit in memory.
Your problem cannot be solved as stated and clarified in the comments.
You have requested a way to have a contiguous in-memory buffer of 50,000,000 entries of size 1024 on a 32 bit system.
A 32 bit system has only 4294967296 bytes of addressable memory. You are asking for 51200000000 bytes of addressable memory, or 11.9 times the amount of memory address space on your system.
If you don't require that your data be contiguous and memory-addressable, if you don't require that your data all be in memory at once, or if you relax other requirements, there may be an answer to your problem. Ie, some OSs expose access to a non-memory space of values that corresponds to RAM (there where ways in 8 gig windows systems to use more than 4 gigs of total RAM) through some hacky interface or other.
But as stated, the answer is "no, you cannot do that".
Because your data must be contiguous, and you know how many elements you need to store, just create a std::vector and use the reserve() function to attempt to gain a contiguous block of memory of the required size.
There is very little overhead in storing a vector (just a few pointers to manage the beginning and end). This is as good as you'll be able to do.
If that fails:
add more memory to your machine (may not actually help, if you've run up against addressing or implementation constraints)
switch to a raw array
find a way to reduce the size of your elements
try to find a solution that can tackle the problem in small blocks
That is 1GB of memory (1024KB * 10^6 = 1MB * 10^3 = 1GB). Ideally for a 32 bit machine upto 4GB memory operations can be performed.
To answer your question, try first a normal malloc() call and allocate 1 GB of memory. This should be done without any error.
Also, please paste the exact error msg that you get while using the vector.

Can __attribute__((packed)) affect the performance of a program?

I have a structure called log that has 13 chars in it. after doing a sizeof(log) I see that the size is not 13 but 16. I can use the __attribute__((packed)) to get it to the actual size of 13 but I wonder if this will affect the performance of the program. It is a structure that is used quite frequently.
I would like to be able to read the size of the structure (13 not 16). I could use a macro, but if this structure is ever changed ie fields added or removed, I would like the new size to be updated without changing a macro because I think this is error prone. Have any suggestion?
Yes, it will affect the performance of the program. Adding the padding means the compiler can use integer load instructions to read things from memory. Without the padding, the compiler must load things separately and do bit shifting to get the entire value. (Even if it's x86 and this is done by the hardware, it still has to be done).
Consider this: Why would compilers insert random, unused space if it was not for performance reasons?
Don't use __attribute__((packed)). If your data structure is in-memory, allow it to occupy its natural size as determined by the compiler. If it's for reading/writing to/from disk, write serialization and deserialization functions; do not simply store cpu-native binary structures on disk. "Packed" structures really have no legitimate uses (or very few; see the comments on this answer for possible disagreeing viewpoints).
Yes, it can affect the performance. In this case, if you allocate an array of such structures with the ((packed)) attribute, most of them must end up unaligned (whereas if you use the default packing, they can all be aligned on 16 byte boundaries). Copying such structures around can be faster if they are aligned.
Yes, it can affect performance. How depends on what it is and how you use it.
An unaligned variable can possibly straddle two cache lines. For example, if you have 64-byte cache lines, and you read a 4-byte variable from an array of 13-byte structures, there is a 3 in 64 (4.6%) chance that it will be spread across two lines. The penalty of an extra cache access is pretty small. If everything your program did was pound on that one variable, 4.6% would be the upper bound of the performance hit. If logging represents 20% of the program's workload, and reading/writing to the that structure is 50% of logging, then you're already at a small fraction of a percent.
On the other hand, presuming that the log needs to be saved, shrinking each record by 3 bytes is saving you 19%, which translates to a lot of memory or disk space. Main memory and especially the disk are slow, so you will probably be better off packing the log to reduce its size.
As for reading the size of the structure without worrying about the structure changing, use sizeof. However you like to do numerical constants, be it const int, enum, or #define, just add sizeof.
As with all other performance optimizations, you'll need to profile your code to find the right answer. The right answer will vary by architecture --- and how your use your structure.
If you're creating gigantic arrays the space savings from packing might mean the difference between fitting and not fitting in cache. Or your data might already fit into your cache, in which case it will make no difference. If you're allocating large numbers of the structures in an STL associative container that allocates the storage for your struct with operator new it might not matter at all --- operator new might round your storage up to something that's aligned anyway.
If most of your structures live on the stack the extra storage might already be optimized away anyway.
For a change this simple to test, I suggest building a timing rig and then trying things both ways. For further optimizations I suggest using a profiler to identify your bottlenecks and go from there.