There are a variety of questions around this topic already on this site:
Guarantees on C++ std::string heap memory allocation?
std::string allocation policy
My question is different from these in that I am interested in how std::string determines its original capacity and how I can rightsize a string assuming that I know how many bytes exactly I will need. Calling reserve(n) can result in the string allocating more memory and I need it to be 24 bytes (right above the sso threshold, can't fit under). Overallocation would be quite dramatic as I potentially hold millions of these in memory, so if it, e.g., aligns at 32 bytes the 33% overhead really hurts. Naturally I would also like to avoid the potential reallocation from shrink_to_fit.
My understanding is that you can get an exact allocation size by initializing as std::string(rightsized_constant, size) via the const char*, size_t ctor, but of course nothing guarantees this.
Is there a reasonably clean way to get this using std::string?
Solution I used after some thinking, discussion with a coworker and brief experiments with clang++/g++ on Linux.
template<typename T>
class Reader {
void operator()(T key, std::string* append_to);
};
...
Reader a;
Reader b;
std::string s;
std::vector<KeyType> keys = ...;
std::vector<std::string> out;
out.reserve(keys.size());
for (const auto& key : keys) {
a(key, &s);
b(key, &s);
out.emplace_back(s); // relies on assumption that copy will rightsize
s.clear(); // relies on the implicit guarantee that this doesn't release memory
} // s rightsized after first iteration
Still curious if anything actually gives a guarantee.
Related
I have some code using a variable length array (VLA), which compiles fine in gcc and clang, but does not work with MSVC 2015.
class Test {
public:
Test() {
P = 5;
}
void somemethod() {
int array[P];
// do something with the array
}
private:
int P;
}
There seem to be two solutions in the code:
using alloca(), taking the risks of alloca in account by making absolutely sure not to access elements outside of the array.
using a vector member variable (assuming that the overhead between vector and c array is not the limiting factor as long as P is constant after construction of the object)
The ector would be more portable (less #ifdef testing which compiler is used), but I suspect alloca() to be faster.
The vector implementation would look like this:
class Test {
public:
Test() {
P = 5;
init();
}
void init() {
array.resize(P);
}
void somemethod() {
// do something with the array
}
private:
int P;
vector<int> array;
}
Another consideration: when I only change P outside of the function, is having a array on the heap which isn't reallocated even faster than having a VLA on the stack?
Maximum P will be about 400.
You could and probably should use some dynamically allocated heap memory, such as managed by a std::vector (as answered by Peter). You could use smart pointers, or plain raw pointers (new, malloc,....) that you should not forget to release (delete,free,....). Notice that heap allocation is probably faster than what you believe (practically, much less than a microsecond on current laptops most of the time).
Sometimes you can move the allocation out of some inner loop, or grow it only occasionally (so for a realloc-like thing, better use unsigned newsize=5*oldsize/4+10; than unsigned newsize=oldsize+1; i.e. have some geometrical growth). If you can't use vectors, be sure to keep separate allocated size and used lengths (as std::vector does internally).
Another strategy would be to special case small sizes vs bigger ones. e.g. for an array less than 30 elements, use the call stack; for bigger ones, use the heap.
If you insist on allocating (using VLAs -they are a commonly available extension of standard C++11- or alloca) on the call stack, be wise to limit your call frame to a few kilobytes. The total call stack is limited (e.g. often to about a megabyte or a few of them on many laptops) to some implementation specific limit. In some OSes you can raise that limit (see also setrlimit(2) on Linux)
Be sure to benchmark before hand-tuning your code. Don't forget to enable compiler optimization (e.g. g++ -O2 -Wall with GCC) before benchmarking. Remember that caches misses are generally much more expensive than heap allocation. Don't forget that developer's time also has some cost (which often is comparable to cumulated hardware costs).
Notice that using static variable or data has also issues (it is not reentrant, not thread safe, not async-signal-safe -see signal-safety(7) ....) and is less readable and less robust.
First of all, you're getting lucky if your code compiles with ANY C++ compiler as is. VLAs are not standard C++. Some compilers support them as an extension.
Using alloca() is also not standard, so is not guaranteed to work reliably (or even at all) when using different compilers.
Using a static vector is inadvisable in many cases. In your case, it gives behaviour that is potentially not equivalent to the original code.
A third option you may wish to consider is
// in definition of class Test
void somemethod()
{
std::vector<int> array(P); // assume preceding #include <vector>
// do something with array
}
A vector is essentially a dynamically allocated array, but will be cleaned up properly in the above when the function returns.
The above is standard C++. Unless you perform rigorous testing and profiling that provides evidence of a performance concern this should be sufficient.
Why don't you make the array a private member?
#include <vector>
class Test
{
public:
Test()
{
data_.resize(5);
}
void somemethod()
{
// do something with data_
}
private:
std::vector<int> data_;
}
As you've specified a likely maximum size of the array, you could also look at something like boost::small_vector, which could be used like:
#include <boost/container/small_vector.hpp>
class Test
{
public:
Test()
{
data_.resize(5);
}
void somemethod()
{
// do something with data_
}
private:
using boc = boost::container;
constexpr std::size_t preset_capacity_ = 400;
boc::small_vector<int, preset_capacity_> data_;
}
You should profile to see if this is actually better, and be aware this will likely use more memory, which could be an issue if there are many Test instances.
For the purpose of PA:DSS I need to be sure that boost::asio::const_buffer (e.g. in boost::asio::async_write) will be zeroed when coming out of scope.
With a STL containers I can substitute a allocator/deallocator like this:
void deallocate(volatile pointer p, size_type n) {
std::memset(p, 0, n * sizeof(T));
::operator delete(p);
}
However I have no idea how to achieve the same with boost::asio::const_buffer, at least not in a way which would still let boost::asio::async_write to consume it. Also I don't want to reinvent the wheel (if there is any).
Short answer: Asio buffers don't own their memory, so they should not be responsible for disposing of it either.
First off, you should not use
std::memset(p, 0, n * sizeof(T));
Use a function like SecureZeroMemory instead: How-to ensure that compiler optimizations don't introduce a security risk?
I realize you had volatile there for this reason, but it might not always be honoured like you expect:
Your secure_memset function might not be sufficient. According to http://open-std.org/jtc1/sc22/wg14/www/docs/n1381.pdf there are optimizing compilers that will only zero the first byte – Daniel Trebbien Nov 9 '12 at 12:50
Background reading:
https://cryptocoding.net/index.php/Coding_rules#Clean_memory_of_secret_data
http://blog.quarkslab.com/a-glance-at-compiler-internals-keep-my-memset.html
On to ASIO
Make sure you fully realize that Boost Asio buffers have no ownership semantics. They only ever reference data owned by another object.
More importantly than the question posed, you might want to check that you keep around the buffer data long enough. A common pitfall is to pass a local as a buffer:
std::string response = "OK\r\n\r\n";
asio::async_write(sock_, asio::buffer(response), ...); // OOOPS!!!
This leads to Undefined Behaviour immediately.
IOW const_buffer is a concept. There are a gazillion ways to construct it on top of (your own) objects:
documentation
A buffer object represents a contiguous region of memory as a 2-tuple consisting of a pointer and size in bytes. A tuple of the form {void*, size_t} specifies a mutable (modifiable) region of memory. Similarly, a tuple of the form {const void*, size_t} specifies a const (non-modifiable) region of memory. These two forms correspond to the classes mutable_buffer and const_buffer, respectively
So, let's assume you have your buffer type
struct SecureBuffer
{
~SecureBuffer() { shred(); }
size_t size() const { return length_; }
char const* data() const { return data_; }
// ...
private:
void shred(); // uses SecureZeroMemory etc.
std::array<char, 1024> data_ = {0};
size_t length_ = 0u;
};
Then you can simply pass it where you want to use it:
SecureBuffer secret; // member variable (lifetime exceeds async operation)
// ... set data
boost::asio::async_write(sock_,
boost::asio::buffer(secret.data(), secret.size()),
/*...*/
);
In a C++ question about optimization and code style, several answers referred to "SSO" in the context of optimizing copies of std::string. What does SSO mean in that context?
Clearly not "single sign on". "Shared string optimization", perhaps?
Background / Overview
Operations on automatic variables ("from the stack", which are variables that you create without calling malloc / new) are generally much faster than those involving the free store ("the heap", which are variables that are created using new). However, the size of automatic arrays is fixed at compile time, but the size of arrays from the free store is not. Moreover, the stack size is limited (typically a few MiB), whereas the free store is only limited by your system's memory.
SSO is the Short / Small String Optimization. A std::string typically stores the string as a pointer to the free store ("the heap"), which gives similar performance characteristics as if you were to call new char [size]. This prevents a stack overflow for very large strings, but it can be slower, especially with copy operations. As an optimization, many implementations of std::string create a small automatic array, something like char [20]. If you have a string that is 20 characters or smaller (given this example, the actual size varies), it stores it directly in that array. This avoids the need to call new at all, which speeds things up a bit.
EDIT:
I wasn't expecting this answer to be quite so popular, but since it is, let me give a more realistic implementation, with the caveat that I've never actually read any implementation of SSO "in the wild".
Implementation details
At the minimum, a std::string needs to store the following information:
The size
The capacity
The location of the data
The size could be stored as a std::string::size_type or as a pointer to the end. The only difference is whether you want to have to subtract two pointers when the user calls size or add a size_type to a pointer when the user calls end. The capacity can be stored either way as well.
You don't pay for what you don't use.
First, consider the naive implementation based on what I outlined above:
class string {
public:
// all 83 member functions
private:
std::unique_ptr<char[]> m_data;
size_type m_size;
size_type m_capacity;
std::array<char, 16> m_sso;
};
For a 64-bit system, that generally means that std::string has 24 bytes of 'overhead' per string, plus another 16 for the SSO buffer (16 chosen here instead of 20 due to padding requirements). It wouldn't really make sense to store those three data members plus a local array of characters, as in my simplified example. If m_size <= 16, then I will put all of the data in m_sso, so I already know the capacity and I don't need the pointer to the data. If m_size > 16, then I don't need m_sso. There is absolutely no overlap where I need all of them. A smarter solution that wastes no space would look something a little more like this (untested, example purposes only):
class string {
public:
// all 83 member functions
private:
size_type m_size;
union {
class {
// This is probably better designed as an array-like class
std::unique_ptr<char[]> m_data;
size_type m_capacity;
} m_large;
std::array<char, sizeof(m_large)> m_small;
};
};
I'd assume that most implementations look more like this.
SSO is the abbreviation for "Small String Optimization", a technique where small strings are embedded in the body of the string class rather than using a separately allocated buffer.
As already explained by the other answers, SSO means Small / Short String Optimization.
The motivation behind this optimization is the undeniable evidence that applications in general handle much more shorter strings than longer strings.
As explained by David Stone in his answer above, the std::string class uses an internal buffer to store contents up to a given length, and this eliminates the need to dynamically allocate memory. This makes the code more efficient and faster.
This other related answer clearly shows that the size of the internal buffer depends on the std::string implementation, which varies from platform to platform (see benchmark results below).
Benchmarks
Here is a small program that benchmarks the copy operation of lots of strings with the same length.
It starts printing the time to copy 10 million strings with length = 1.
Then it repeats with strings of length = 2. It keeps going until the length is 50.
#include <string>
#include <iostream>
#include <vector>
#include <chrono>
static const char CHARS[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
static const int ARRAY_SIZE = sizeof(CHARS) - 1;
static const int BENCHMARK_SIZE = 10000000;
static const int MAX_STRING_LENGTH = 50;
using time_point = std::chrono::high_resolution_clock::time_point;
void benchmark(std::vector<std::string>& list) {
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
// force a copy of each string in the loop iteration
for (const auto s : list) {
std::cout << s;
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
const auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();
std::cerr << list[0].length() << ',' << duration << '\n';
}
void addRandomString(std::vector<std::string>& list, const int length) {
std::string s(length, 0);
for (int i = 0; i < length; ++i) {
s[i] = CHARS[rand() % ARRAY_SIZE];
}
list.push_back(s);
}
int main() {
std::cerr << "length,time\n";
for (int length = 1; length <= MAX_STRING_LENGTH; length++) {
std::vector<std::string> list;
for (int i = 0; i < BENCHMARK_SIZE; i++) {
addRandomString(list, length);
}
benchmark(list);
}
return 0;
}
If you want to run this program, you should do it like ./a.out > /dev/null so that the time to print the strings isn't counted.
The numbers that matter are printed to stderr, so they will show up in the console.
I have created charts with the output from my MacBook and Ubuntu machines.
Note that there is a huge jump in the time to copy the strings when the length reaches a given point.
That's the moment when strings don't fit in the internal buffer anymore and memory allocation has to be used.
Note also that on the linux machine, the jump happens when the length of the string reaches 16.
On the macbook, the jump happens when the length reaches 23. This confirms that SSO depends on the platform implementation.
Ubuntu
Macbook Pro
I have a function foo(int[] nums) which I understand is essentially equivalent to foo(int* nums). Inside foo I need to copy the contents of the array pointed to by numsinto some int[10] declared within the scope of foo. I understand the following is invalid:
void foo (int[] nums)
{
myGlobalArray = *nums
}
What is the proper way to copy the array? Should I use memcpy like so:
void foo (int[] nums)
{
memcpy(&myGlobalArray, nums, 10);
}
or should I use a for loop?
void foo(int[] nums)
{
for(int i =0; i < 10; i++)
{
myGlobalArray[i] = nums[i];
}
}
Is there a third option that I'm missing?
Yes, the third option is to use a C++ construct:
std::copy(&nums[0], &nums[10], myGlobalArray);
With any sane compiler, it:
should be optimum in the majority of cases (will compile to memcpy() where possible),
is type-safe,
gracefully copes when you decide to change the data-type to a non-primitive (i.e. it calls copy constructors, etc.),
gracefully copes when you decide to change to a container class.
Memcpy will probably be faster, but it's more likely you will make a mistake using it.
It may depend on how smart your optimizing compiler is.
Your code is incorrect though. It should be:
memcpy(myGlobalArray, nums, 10 * sizeof(int) );
Generally speaking, the worst case scenario will be in an un-optimized debug build where memcpy is not inlined and may perform additional sanity/assert checks amounting to a small number of additional instructions vs a for loop.
However memcpy is generally well implemented to leverage things like intrinsics etc, but this will vary with target architecture and compiler. It is unlikely that memcpy will ever be worse than a for-loop implementation.
People often trip over the fact that memcpy sizes in bytes, and they write things like these:
// wrong unless we're copying bytes.
memcpy(myGlobalArray, nums, numNums);
// wrong if an int isn't 4 bytes or the type of nums changed.
memcpy(myGlobalArray, nums, numNums);
// wrong if nums is no-longer an int array.
memcpy(myGlobalArray, nums, numNums * sizeof(int));
You can protect yourself here by using language features that let you do some degree of reflection, that is: do things in terms of the data itself rather than what you know about the data, because in a generic function you generally don't know anything about the data:
void foo (int* nums, size_t numNums)
{
memcpy(myGlobalArray, nums, numNums * sizeof(*nums));
}
Note that you don't want the "&" infront of "myGlobalArray" because arrays automatically decay to pointers; you were actually copying "nums" to the address in memory where the pointer to the myGlobalArray[0] was being held.
(Edit note: I'd typo'd int[] nums when I mean't int nums[] but I decided that adding C array-pointer-equivalence chaos helped nobody, so now it's int *nums :))
Using memcpy on objects can be dangerous, consider:
struct Foo {
std::string m_string;
std::vector<int> m_vec;
};
Foo f1;
Foo f2;
f2.m_string = "hello";
f2.m_vec.push_back(42);
memcpy(&f1, &f2, sizeof(f2));
This is the WRONG way to copy objects that aren't POD (plain old data). Both f1 and f2 now have a std::string that thinks it owns "hello". One of them is going to crash when they destruct, and they both think they own the same vector of integers that contains 42.
The best practice for C++ programmers is to use std::copy:
std::copy(nums, nums + numNums, myGlobalArray);
Note per Remy Lebeau or since C++11
std::copy_n(nums, numNums, myGlobalArray);
This can make compile time decisions about what to do, including using memcpy or memmove and potentially using SSE/vector instructions if possible. Another advantage is that if you write this:
struct Foo {
int m_i;
};
Foo f1[10], f2[10];
memcpy(&f1, &f2, sizeof(f1));
and later on change Foo to include a std::string, your code will break. If you instead write:
struct Foo {
int m_i;
};
enum { NumFoos = 10 };
Foo f1[NumFoos], f2[NumFoos];
std::copy(f2, f2 + numFoos, f1);
the compiler will switch your code to do the right thing without any additional work for you, and your code is a little more readable.
For performance, use memcpy (or equivalents). It's highly optimised platform-specific code for shunting lots of data around fast.
For maintainability, consider what you're doing - the for loop may be more readable and easier to understand. (Getting a memcpy wrong is a fast route to a crash or worse)
Essentially, as long as you are dealing with POD types (Plain Ol' Data), such as int, unsigned int, pointers, data-only structs, etc... you are safe to use mem*.
If your array contains objects, use the for loop, as the = operator may be required to ensure proper assignment.
A simple loop is slightly faster for about 10-20 bytes and less (It's a single compare+branch, see OP_T_THRES), but for larger sizes, memcpy is faster and portable.
Additionally, if the amount of memory you want to copy is constant, you can use memcpy to let the compiler decide what method to use.
Side note: the optimizations that memcpy uses may significantly slow your program down in a multithreaded environment when you're copying a lot of data above the OP_T_THRES size mark since the instructions this invokes are not atomic and the speculative execution and caching behavior for such instructions doesn't behave nicely when multiple threads are accessing the same memory. Easiest solution is to not share memory between threads and only merge the memory at the end. This is good multi-threading practice anyway.
My question is with regard to C++
Suppose I write a function to return a list of items to the caller. Each item has 2 logical fields: 1) an int ID, and 2) some data whose size may vary, let's say from 4 bytes up to 16Kbytes. So my question is whether to use a data structure like:
struct item {
int field1;
char field2[MAX_LEN];
OR, rather, to allocate field2 from the heap, and require the caller to destroy when he's done:
struct item{
int field1;
char *field2; // new char[N] -- destroy[] when done!
Since the max size of field #2 is large, is makes sense that this would be allocated from the heap, right? So once I know the size N, I call field2 = new char[N], and populate it.
Now, is this horribly inefficient?
Is it worse in cases where N is always small, i.e. suppose I have 10000 items that have N=4?
You should instead use one of the standard library containers, like std::string or std::vector<char>; then you don't have to worry about managing the memory yourself.
The thing that's horribly in efficient is all the time you will waste tracking down memory leaks. Use classes that take care of this for you.
But if you don't want to do that:
suppose I have 10000 items that have N=4?
So you waste 40k of memory - your PC has at least a gigabyte, probably two, don't worry about it. A consistent interface, even if you're doing new/delete, is better than something fancy that will be harder to debug.
The only time when can safely use fixed-size buffers in production code is sizes are compile-time system constants, such as MAX_PATH.
You could do both:
struct item {
...
char *field2; // Points to buf if < 8 chars (assuming null-terminator).
char buf[8];
};
This does require some clever copy semantics, so you'll need a custom copy-constructor and assignment operator.
Alternatively, if item is always heap-allocated, you could ensure that item and its data are always allocated together:
struct item {
...
char field2[1];
}
item* new_item(int size) {
int offset = &((item*)0)->field2[0] - (char*)0;
return new(malloc(offset + size)) item;
}
Actually it depends. As I see it:
statically sized buffer
Good
No need to manage memory
Very efficient in terms of execution speed
Bad
Might waste some memory
dynamically sized buffer
Good
Does not have to waste any memory, as the exact amount needed is known
Bad
Memory must be managed.
Might be slow(er)
With that in mind, and based on the situation (Is it likely sizes will vary much? Is execution speed extra important? ... ), pick one.