Understanding the efficiency of an std::string - c++

I'm trying to learn a little bit more about c++ strings.
consider
const char* cstring = "hello";
std::string string(cstring);
and
std::string string("hello");
Am I correct in assuming that both store "hello" in the .data section of an application and the bytes are then copied to another area on the heap where the pointer managed by the std::string can access them?
How could I efficiently store a really really long string? I'm kind of thinking about an application that reads in data from a socket stream. I fear concatenating many times. I could imagine using a linked list and traverse this list.
Strings have intimidated me for far too long!
Any links, tips, explanations, further details, would be extremely helpful.

I have stored strings in the 10's or 100's of MB range without issue. Naturally, it will be primarily limited by your available (contiguous) memory / address space.
If you are going to be appending / concatenating, there are a few things that may help efficiency-wise: If possible, try to use the reserve() member function to pre-allocate space-- even if you have a rough idea of how big the final size might be, it would save from unnecessary re-allocations as the string grows.
Additionally, many string implementations use "exponential growth", meaning that they grow by some percentage, rather than fixed byte size. For example, it might simply double the capacity any time additional space is needed. By increasing size exponentially, it becomes more efficient to perform lots of concatenations. (The exact details will depend on your version of stl.)
Finally, another option (if your library supports it) is to use rope<> template: Ropes are similar to strings, except that they are much more efficient when performing operations on very large strings. In particular, "ropes are allocated in small chunks, significantly reducing memory fragmentation problems introduced by large blocks". Some additional details on SGI's STL guide.

Since you're reading the string from a socket, you can reuse the same packet buffers and chain them together to represent the huge string. This will avoid any needless copying and is probably the most efficient solution possible. I seem to remember that the ACE library provides such a mechanism. I'll try to find it.
EDIT: ACE has ACE_Message_Block that allows you to store large messages in a linked-list fashion. You almost need to read the C++ Network Programming books to make sense of this colossal library. The free tutorials on the ACE website really suck.
I bet Boost.Asio must be capable of doing the same thing as ACE's message blocks. Boost.Asio now seems to have a larger mindshare than ACE, so I suggest looking for a solution within Boost.Asio first. If anyone can enlighten us about a Boost.Asio solution, that would be great!
It's about time I try writing a simple client-server app using Boost.Asio to see what all the fuss is about.

I don't think efficiency should be the issue. Both will perform well enough.
The deciding factor here is encapsulation. std::string is a far better abstraction than char * could ever be. Encapsulating pointer arithmetic is a good thing.
A lot of people thought long and hard to come up with std::string. I think failing to use it for unfounded efficiency reasons is foolish. Stick to the better abstraction and encapsulation.

As you probably know, an std::string is really just another name for basic_string<char>.
That said, they are a sequence container and memory will be allocated sequentially. It's possible to get an exceptions from an std::string if you try to make one bigger than the available contiguous memory that you can allocate. This threshold is typically considerably less than the total available memory due to memory fragmentation.
I've seen problems allocating contiguous memory when trying to allocate, for instance, large contiguous 3D buffers for images. But these issues don't start happening at least on the order of 100MB or so, at least in my experience, on Windows XP Pro (for instance.)
Are your strings this big?

Related

UART stream packetisation; stream or vector?

I am writing some code to interface an STM32H7 with a BM64 Bluetooth module over UART.
The BM64 expects binary data in bytes; in general:
1. Start word (0xAA)
2-3. Payload length
4. Message ID
5-n. Payload
n+1. Checksum
My question is around best practice for message queuing, namely:
Custom iostream, message vectors inside an interface class or other?
My understanding so far, please correct if wrong or add if something missed:
Custom iostream has the huge benefit of concise usage inline with cout etc. Very usable and clean and most likely portable, at least in principle, to other devices on this project operating on other UART ports. The disadvantage is that it is relatively a lot of work to create a custom streambuf and not sure what to use for "endl" (can't use null or '\n' as these may exist in the message, with it being binary.)
Vectors seem a bit dirty to me and particularly for embedded stuff, the dynamic allocations could be stealing a lot of memory unless I ruthlessly spend cycles on resize() and reserve(). However, a vector of messages (defined as either a class or struct) would be very quick and easy to do.
Is there another solution? Note, I'd prefer not to use arrays, i.e. passing around buffer pointers and buffer lengths.
What would you suggest in this application?
On bare metal systems I prefer fixed sized buffers with the maximum possible payload size. Two of them, fixed allocated, one to fill and one to send in parallel, switch over when finished. All kind of dynamic memory allocation ends in memory fragmentation, especially if such buffers jitters in size.
Even if you system have an MMU, it is maybe a good idea to not do much dynamic heap allocation at all. I often used own written block pool memory management to get rid of long time fragmentation and late alloc failures.
If you fear to use more than currently needed ram, think again: If you have such less ram that you can't spend more than currently needed, your system may fail sometimes, if you really need the maximum buffer size. That is never an option on embedded at all. The last one is a good argument to have all memory allocated more or less fixed as long as it is possible that under real runtime conditions this can happen at "some point in the future" :-)

Can STL help addressing memory fragmentation

This is regarding a new TCP server being developed (in C++ on Windows/VC2010)
Thousands of clients connect and keep sending enormous asynchronous requests. I am storing incoming requests in raw linked list ('C' style linked-list of structures, where each structure is a request) and process them one by one in synchronized threads.
I am using in new and delete to create/destroy those request structures.
Till date I was under impression its most efficient approach. But recently I found even after all clients were disconnected, Private Bytes of server process still showed lots of memory consumption (around 45 MB) It never came back to it's original level.
I dig around a lot and made sure there are no memory leaks. Finally, I came across this and realized its because of memory fragmentation caused of lots of new and delete calls.
Now my couple of questions are:
If I replace my raw linked list with STL data structures to store incoming requests, will it help me getting rid of memory fragmentation ? (Because as per my knowledge STL uses contiguous blocks. Kind of its own memory management resulting in very less fragmentation. But I am not sure if this is true.)
What would be performance impact in that case as compared to raw linked list?
I suspect your main problem is that you are using linked lists. Linked lists are horrible for this sort of thing and cause exactly the problem you are seeing. Many years ago, I wrote TCP code that did very similar things, in plain old C. The way to deal with this is to use dynamic arrays. You end up with far fewer allocations.
In those bad old days, I rolled my own, which is actually quite simple. Just allocate a single data structure for some number of records, say ten. When you are about to overflow, double the size, reallocating and copying. Because you increase the size exponentially, you will never have more than a handful of allocations, making fragmentation a non-issue. In addition, you have none of the overhead that comes with list.
Really, lists should almost never be used.
Now in terms of your actual question, yes, the STL should help you, but DON'T use std:list. Use std:vector in the manner I just outlined. In my experience, in 95% of the cases, std:list is an inferior choice.
If you use std:vector, you may want to use vector::reserve to preallocate the number of records you expect you may see. It'll save you a few allocations.
Have you seen that your memory usage and fragmentation is causing you performance problems? I would think it is more from doing new / delete a lot. STL probably won't help unless you use your own allocator and pre-allocate a large chunk and manage it yourself. In other words, it will require a lot of work.
It's often OK to use up memory if you have it. You may want to consider pooling your request structures so you don't need to reallocate them. Then you can allocate on demand and add them to your pool.
Maybe. std::list allocates each node dynamically like a homebrew linked list. "STL uses contiguous block.." - this is not true. You could try std::vector which is like an array and therefore will cause less memory fragmentation. Depends on what you need the data structure for.
I wouldn't expect any discernable difference in performance between a (well-implemented) homebrew linked list and std::list. If you need a stack, std::vector is much more efficient and if you need a queue (eg fifo) then std::deque is much more efficient than linked lists.
If you are serious about preventing memory fragmentation, you will need to manage your own memory and custom allocators or use some third party library. It's not a trivial task.
Instead of raw pointers you can use std::unique_ptr. It has minimal overhead, and makes sure your pointers get deleted.
In my opinion there are pretty few cases where a linked list is the right choice of data structure. You need to chose your data structure based on the way you use your data. For example using a vector will keep your data together, which is good for cache, if you can manage to add/remove elements to it's end, then you avoid fragmentation.
If you want to avoid the overhead of new/deletes you can pool your objects. This way you still need to handle fragmentation.

Reading from a socket into a buffer

This question might seem simple, but I think it's not so trivial. Or maybe I'm overthinking this, but I'd still like to know.
Let's imagine we have to read data from a TCP socket until we encounter some special character. The data has to be saved somewhere. We don't know the size of the data, so we don't know how large to make our buffer. What are the possible options in this case?
Extend the buffer as more data arrives using realloc. This approach raises a few questions. What are the performance implications of using realloc? It may move memory around, so if there's a lot of data in the buffer (and there can be a lot of data), we'll spend a lot of time moving bytes around. How much should we extend the buffer size? Do we double it every time? If yes, what about all the wasted space? If we call realloc later with a smaller size, will it truncate the unused bytes?
Allocate new buffers in constant-size chunks and chain them together. This would work much like the deque container from the C++ standard library, allowing for quickly appending new data. This also has some questions, like how big should we make the block and what to do with the unused space, but at least it has good performance.
What is your opinion on this? Which of these two approaches is better? Maybe there is some other approach I haven't considered?
P.S.:
Personally, I'm leaning more towards the second solution, because I think it can be made pretty fast if we "recycle" the blocks instead of doing dynamic allocations every time a block is needed. The only problem I can see with it is that it hurts locality, but I don't think that it's terribly important for my purposes (processing HTTP-like requests).
Thanks
I'd prefer the second variant. You may also consider to use just one raw buffer and process the received data before you receive another bunch of data from the socket, i.e. start processing the data before you encounter the special character.
In any case I would not recommend using raw memory and realloc, but use std::vector which has its own reallocation, or use std::array as a fixed size buffer.
You may also be interested in Boost.Asio's socket_iostreams wich provide another abstraction layer above the raw buffer.
Method 2 sounds better, however there may be significant ramifications on your parser... i.e. once you find your special marker, dealing with non-contiguous buffers while parsing for HTTP requests may end up being more costly or complex than reallocing a large buffer (method 1). Net-net: if your parser is trivial, go with 2, if not, go with 1.

Dynamically allocate or waste memory?

I have a 2d integer array used for a tile map.
The size of the map is unknown and read in from a file at runtime. currently the biggest file is 2500 items(50x50 grid).
I have a working method of dynamic memory allocation from an earlier question but people keep saying that it a bad idea so I have been thinking whether or not to just use a big array and not fill it all up when using a smaller map.
Do people know of any pros or cons to either solution ? any advice or personal opinions welcome.
c++ btw
edit: all the maps are made by me so I can pick a max size.
Probably the easiest way is for example a std::vector<std::vector<int> > to allow it to be dynamically sized AND let the library do all the allocations for you. This will prevent accidentally leaking memory.
My preference would be to dynamically allocate. That way should you encounter a surprisingly large map you (hopefully) won't overflow if you've written it correctly, whereas with the fixed size your only option is to return an error and fail.
Presumably loading tile maps is a pretty infrequent operation. I'd be willing to bet too that you can't even measure a meaningful difference in speed between the two. Unless there is a measurable performance reduction, or you're actually hitting something else which is causing you problems the static sized one seems like a premature optimisation and is asking for trouble later on.
It depends entirely on requirements that you haven't stated :-)
If you want your app to be as blazingly fast as possible, with no ability to handle larger tile maps, then by all means just use a big array. For small PIC-based embedded systems this could be an ideal approach.
But, if you want your code to be robust, extensible, maintainable and generally suitable for a wider audience, use STL containers.
Or, if you just want to learn stuff, and have no concern about maintainability or performance, try and write your own dynamically allocating containers from scratch.
I believe the issue people refer to with dynamic allocation results from allocating randomly sized blocks of memory and not being able to effectively manage the random sized holes left when deallocated. If you're allocating fixed sized tiles then this may not be an issue.
I see quite a few people suggest allocating a large block of memory and managing it themselves. That might be an alternative solution.
Is allocating the memory dynamically a bottleneck in your program? Is it the cause of a performance issue? If not, then simply keep dynamic allocation, you can handle any map size. If yes, then maybe use some data structure that does not deallocate the memory it has allocated but rather use its old buffer and if needed, reallocate more memory.

Optimizing for space instead of speed in C++

When you say "optimization", people tend to think "speed". But what about embedded systems where speed isn't all that critical, but memory is a major constraint? What are some guidelines, techniques, and tricks that can be used for shaving off those extra kilobytes in ROM and RAM? How does one "profile" code to see where the memory bloat is?
P.S. One could argue that "prematurely" optimizing for space in embedded systems isn't all that evil, because you leave yourself more room for data storage and feature creep. It also allows you to cut hardware production costs because your code can run on smaller ROM/RAM.
P.P.S. References to articles and books are welcome too!
P.P.P.S. These questions are closely related: 404615, 1561629
My experience from an extremely constrained embedded memory environment:
Use fixed size buffers. Don't use pointers or dynamic allocation because they have too much overhead.
Use the smallest int data type that works.
Don't ever use recursion. Always use looping.
Don't pass lots of function parameters. Use globals instead. :)
There are many things you can do to reduce your memory footprints, I'm sure people have written books on the subject, but a few of the major ones are:
Compiler options to reduce code size (including -Os and packing/alignment options)
Linker options to strip dead code
If you're loading from flash (or ROM) to ram to execute (rather than executing from flash), then use a compressed flash image, and decompress it with your bootloader.
Use static allocation: a heap is an inefficient way to allocate limited memory, and if it might fail due to fragmentation if it is constrained.
Tools to find the stack high-watermark (typically they fill the stack with a pattern, execute the program, then see where the pattern remains), so you can set the stack size(s) optimally
And of course, optimising the algorithms you use for memory footprint (often at expense of speed)
A few obvious ones
If speed isn't critical, execute the code directly from flash.
Declare constant data tables using const. This will avoid the data being copied from flash to RAM
Pack large data tables tightly using the smallest data types, and in the correct order to avoid padding.
Use compression for large sets of data (as long as the compression code doesn't outweigh the data)
Turn off exception handling and RTTI.
Did anybody mention using -Os? ;-)
Folding knowledge into data
One of the rules of Unix philosophy can help make code more compact:
Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
I can't count how many times I've seen elaborate branching logic, spanning many pages, that could've been folded into a nice compact table of rules, constants, and function pointers. State machines can often be represented this way (State Pattern). The Command Pattern also applies. It's all about the declarative vs imperative styles of programming.
Log codes + binary data instead of text
Instead of logging plain text, log event codes and binary data. Then use a "phrasebook" to reconstitute the event messages. The messages in the phrasebook can even contain printf-style format specifiers, so that the event data values are displayed neatly within the text.
Minimize the number of threads
Each thread needs it own memory block for a stack and TSS. Where you don't need preemption, consider making your tasks execute co-operatively within the same thread (cooperative multi-tasking).
Use memory pools instead of hoarding
To avoid heap fragmentation, I've often seen separate modules hoard large static memory buffers for their own use, even when the memory is only occasionally required. A memory pool could be used instead so the the memory is only used "on demand". However, this approach may require careful analysis and instrumentation to make sure pools are not depleted at runtime.
Dynamic allocation only at initialization
In embedded systems where only one application runs indefinitely, you can use dynamic allocation in a sensible way that doesn't lead to fragmentation: Just dynamically allocate once in your various initialization routines, and never free the memory. reserve() your containers to the correct capacity and don't let them auto-grow. If you need to frequently allocate/free buffers of data (say, for communication packets), then use memory pools. I once even extended the C/C++ runtimes so that it would abort my program if anything tried to dynamically allocate memory after the initialization sequence.
As with all optimization, first optimize algorithms, second optimize the code and data, finally optimize the compiler.
I don't know what your program does, so I can't advice on algorithms. Many others have written about the compiler. So, here's some advice on code and data:
Eliminate redundancy in your code. Any repeated code that's three or more lines long, repeated three times in your code, should be changed to a function call.
Eliminate redundancy in your data. Find the most compact representation: merge read-only data, and consider using compression codes.
Run the code through a regular profiler; eliminate all code that isn't used.
Generate a map file from your linker. It will show how the memory is allocated. This is a good start when optimizing for memory usage. It also will show all the functions and how the code-space is laid out.
Here's a book on the subject Small Memory Software: Patterns for systems with limited memory.
Compile in VS with /Os. Often times this is even faster than optimizing for speed anyway, because smaller code size == less paging.
Comdat folding should be enabled in the linker (it is by default in release builds)
Be careful about data structure packing; often time this results in the compiler generated more code (== more memory) to generate the assembly to access unaligned memory. Using 1 bit for a boolean flag is a classic example.
Also, be careful when choosing a memory efficient algorithm over an algorithm with a better runtime. This is where premature optimizations come in.
Ok most were mentioned already, but here is my list anyway:
Learn what your compiler can do. Read compiler documentation, experiment with code examples. Check settings.
Check generated code at target optimization level. Sometimes results are surprising and often it turns out optimization actually slows things down (or just take too much space).
choose suitable memory model. If you target really small tight system, large or huge memory model might not be the best choice (but usually easisest to program for...)
Prefer static allocation. Use dynamic allocation only on startup or over
statically allocated buffer (pool or maximum instance sized static buffer).
Use C99 style data types. Use smallest sufficient data type, for storage types. Local variables like loop variables are sometimes more efficient with "fast" data types.
Select inline candidates. Some parameter heavy function with relatively simple bodies are better off when inlined. Or consider passing structure of parameters. Globals are also option, but be careful - tests and maintenance can become difficult if anyone in them isn't disciplned enough.
Use const keyword well , be aware of array initialization implications.
Map file, ideally also with module sizes. Check also what is included from crt (is it really neccessary?).
Recursion just say no (limited stack space)
Floating point numbers - prefer fixed point math. Tends to include and call a lot of code (even for simple addition or multiplication).
C++ you should know C++ VERY WELL. If you don't, program constrainted embedded systems in C, please. Those who dare must be careful with all advanced C++ constructs (inheritance, templates, exceptions, overloading, etc.). Consider close to HW code to be
rather Super-C and C++ is used where it counts: in high level logic, GUI, etc.
Disable whatever you don't need in compiler settings (be it parts of libraries, language constructs, etc.)
Last but not least - while hunting for smallest possible code size - don't overdo it. Watch out also for performance and maintainability. Over-optimized code tends to decay very quickly.
Firstly, tell your compiler to optimize for code size. GCC has the -Os flag for this.
Everything else is at the algorithmic level - use similar tools that you would for finding memory leaks, but instead look for allocs and frees that you could avoid.
Also take a look at commonly used data structure packing - if you can shave a byte or two off them, you can cut down memory use substantially.
If you're looking for a good way to profile your application's heap usage, check out valgrind's massif tool. It will let you take snapshots of your app's memory usage profile over time, and you can then use that information to better see where the "low hanging fruit" is, and aim your optimizations accordingly.
Profiling code or data bloat can be done via map files: for gcc see here, for VS see here.
I have yet to see a useful tool for size profiling though (and don't have time to fix my VS AddIn hack).
on top what others suggest:
Limit use of c++ features, write like in ANSI C with minor extensions. Standard (std::) templates use a large system of dynamic allocation. If you can, avoid templates altogether. While not inherently harmful, they make it way too easy to generate lots and lots of machine code from just a couple simple, clean, elegant high-level instructions. This encourages writing in a way that - despite all the "clean code" advantages - is very memory hungry.
If you must use templates, write your own or use ones designed for embedded use, pass fixed sizes as template parameters, and write a test program so you can test your template AND check your -S output to ensure the compiler is not generating horrible assembly code to instantiate it.
Align your structures by hand, or use #pragma pack
{char a; long b; char c; long d; char e; char f; } //is 18 bytes,
{char a; char c; char d; char f; long b; long d; } //is 12 bytes.
For the same reason, use a centralized global data storage structure instead of scattered local static variables.
Intelligently balance usage of malloc()/new and static structures.
If you need a subset of functionality of given library, consider writing your own.
Unroll short loops.
for(i=0;i<3;i++){ transform_vector[i]; }
is longer than
transform_vector[0];
transform_vector[1];
transform_vector[2];
Don't do that for longer ones.
Pack multiple files together to let the compiler inline short functions and perform various optimizations Linker can't.
Don't be afraid to write 'little languages' inside your program. Sometimes a table of strings and an interpreter can get a LOT done. For instance, in a system I've worked on, we have a lot of internal tables, which have to be accessed in various ways (loop through, whatever). We've got an internal system of commands for referencing the tables that forms a sort of half-way language that's quite compact for what it gets donw.
But, BE CAREFUL! Know that you are writing such things (I wrote one accidentally, myself), and DOCUMENT what you are doing. The original developers do NOT seem to have been conscious of what they were doing, so it's much harder to manage than it should be.
Optimizing is a popular term but often technically incorrect. It literally means to make optimal. Such a condition is never actually achieved for either speed or size. We can simply take measures to move toward optimization.
Many (but not all) of the techniques used to move toward minimum time to a computing result sacrifices memory requirement, and many (but not all) of the techniques used to move toward minimum memory requirement lengthens the time to result.
Reduction of memory requirements amounts to a fixed number of general techniques. It is difficult to find a specific technique that does not neatly fit into one or more of these. If you did all of them, you'd have something very close to the minimal space requirement for the program if not the absolute minimum possible. For a real application, it could take a team of experienced programmers a thousand years to do it.
Remove all redundancy from stored data, including intermediates.
Remove all need for storing data that could be streamed instead.
Allocate only the number of bytes needed, never a single more.
Remove all unused data.
Remove all unused variables.
Free data as soon as it is no longer possibly needed.
Remove all unused algorithms and branches within algorithms.
Find the algorithm that is represented in the minimally sized execution unit.
Remove all unused space between items.
This is a computer science view of the topic, not a developer's one.
For instance, packing a data structure is an effort that combines (3) and (9) above. Compressing data is a way to at least partly achieve (1) above. Reducing overhead of higher level programming constructs is a way to achieve some progress in (7) and (8). Dynamic allocation is an attempt to exploit a multitasking environment to employ (3). Compilation warnings, if turned on, can help with (5). Destructors attempt to assist with (6). Sockets, streams, and pipes can be used to accomplish (2). Simplifying a polynomial is a technique to gain ground in (8).
Understanding of the meaning of nine and the various ways to achieve them is the result of years of learning and checking memory maps resulting from compilation. Embedded programmers often learn them more quickly because of limited memory available.
Using the -Os option on a gnu compiler makes a request to the compiler to attempt to find patterns that can be transformed to accomplish these, but the -Os is an aggregate flag that turns on a number of optimization features, each of which attempts to perform transformations to accomplish one of the 9 tasks above.
Compiler directives can produce results without programmer effort, but automated processes in the compiler rarely correct problems created by lack of awareness in the writers of the code.
Bear in mind the implementation cost of some C++ features, such as virtual function tables and overloaded operators that create temporary objects.
Along with that everyone else said, I'd just like to add don't use virtual functions because with virtual functions a VTable must be created which can take up who knows how much space.
Also watch out for exceptions. With gcc, I don't believe there is a growing size for each try-catch block(except for 2 function calls for each try-catch), but there is a fixed size function which must be linked in which could be wasting precious bytes