Construct std::string from up to X characters, stopping at null char - c++

I am reading strings from a structure in a file where each string has a fixed length, with '\0' padding. They are not zero-terminated if the stored string needs the whole length.
I'm currently constructing std::strings out of those like this:
// char MyString[1000];
std::string stdmystring(MyString, ARRAYSIZE(MyString));
However, this copies the padding, too. I could trim the string now, but is there an elegant and quick way to prevent the copying in the first place?
Speed is more important than space, because this runs in a loop.

Simple solutions are:
Just calculate the correct length first
either use strnlen as Dieter suggested
or std::find(MyString,MyString+ARRAYSIZE(MyString),'\0') which IME isn't any slower
note that if your string fits in cache, that will likely dominate the extra loop cost
reserve the max string size (you did say space was less important), and write a loop appending characters until you exhaust the width or hit a nul (like copy_until)
actually create a max-size string initialized with nuls, strncpy into it, and optionally erase unused nuls if you want the size to be correct
The second option uses only a single loop, while the third notionally uses two (it in the string ctor, and then in the copy). However, the push_back of each character seems more expensive than the simple character assignment, so I wouldn't be surprised if #3 were faster in reality. Profile and see!

Well If size is not a problem one potential way to do it is to create an empty std::string then use reserve() to pre-allocate the space potentially needed and then add each char until you come across '\0'.
std::string stdmystring;
stdmystring.reserve(MyString_MAX_SIZE) ;
for(size_t i=0;i<MyString_MAX_SIZE && MyString[i]!='\0';++i);
stdmystring+=MyString[i];
reserve() garanties you one memory allocation since you know the max_size and the string will never get larger than that.
The calls to += operator function will probably be inlined but it still has to check that the string has the needed capacity which is wasteful in your case. Infact this could be the same or worse than simply using strlen to find the exact length of the string first so you have to test it.

I think the most straightforward way is to overallocate your internal MyString array by one byte, always null terminate that final byte, and use the C-string constructor of std::string. (Keep in mind that most likely your process will be I/O bound on the file so whatever algorithm the C-string constructor uses should be fine).

Related

Are two heap allocations more expensive than a call to std::string fill ctor?

I want to have a string with a capacity of 131 chars (or bytes). I know two simple ways of achieving that. So which of these two code blocks is faster and more efficient?
std::string tempMsg( 131, '\0' ); // constructs the string with a 131 byte buffer from the start
tempMsg.clear( ); // clears those '\0' chars to free space for the actual data
tempMsg += "/* some string literals that are appended */";
or this one:
std::string tempMsg; // default constructs the string with a 16 byte buffer
tempMsg.reserve( 131 ); // reallocates the string to increase the buffer size to 131 bytes??
tempMsg += "/* some string literals that are appended */";
I guess the first approach only uses 1 allocation and then sets all those 131 bytes to 0 ('\0') and then clears the string (std::string::clear is generally constant according to: https://www.cplusplus.com/reference/string/string/clear/).
The second approach uses 2 allocations but on the other hand, it doesn't have to set anything to '\0'. But I've also heard about compilers allocating 16 bytes on the stack for a string object for optimization purposes. So the 2nd method might use only 1 heap allocation as well.
So is the first method faster than the other one? Or are there any other better methods?
The most accurate answer is that it depends. The most probable answer is the second being faster or as fast. Calling the fill ctor requires not only a heap allocation but a fill (typically translates to a memset in my experience).
clear usually won't do anything with a POD char besides setting a first pointer or size integer to zero because char is a trivially-destructible type. There's no loop involved with clear usually unless you create std::basic_string with a non-trivial UDT. It's constant-time otherwise and dirt-cheap in practically every standard library implementation.
Edit: An Important Note:
I never encountered a standard lib implementation that does this or it has slipped my memory (very possible as I think I'm turning senile), but there is something very important that Viktor Sehl pointed out to me that I was very ignorant about in the comments:
Please note that std::string::clear() on some implementations free the allocated memory (if there are any), unlike a std::vector. –
That would actually make your first version involve two heap allocations. But the second should still only be one (opposite of what you thought).
Resumed:
But I've also heard about compilers allocating 16 bytes on the stack for a string object for optimization purposes. So the 2nd method might use only 1 heap allocation as well.
Small Buffer Optimizations
The first allocation is a small-buffer stack optimization for implementations that use it (technically not always stack, but it'll avoid additional heap allocations). It's not separately heap-allocated and you can't avoid it with a fill ctor (the fill ctor will still allocate the small buffer). What you can avoid is filling the entire array with '\0' before you fill it with what you actually want, and that's why the second version is likely faster (marginally or not depending on how many times you invoke it from a loop). That's needless overhead unless the optimizer eliminates it for you, and it's unlikely in my experience that optimizers will do that in loopy cases that can't be optimized with something like SSA.
I just pitched in here because your second version is also clearer in intent than filling a string with something as an attempted optimization (in this case a very possibly misguided one if you ask me) only to throw it out and replace it with what you actually want. The second is at least clearer in intent and almost certainly as fast or faster in most implementations.
On Profiling
I would always suggest measuring though if in doubt, and especially before you start attempting funny things like in your first example. I can't recommend the profiler enough if you're in working in performance-critical fields. The profiler will not only answer this question for you but it'll also teach you to refrain from writing such counter-intuitive code like in the first example except in places where it makes a real positive difference (in this case I think the difference is actually negative or neutral). From my perspective, the use of both profiler and debugging should be something ideally taught in CS 101. The profiler helps mitigate the dangerous tendency for people to optimize the wrong things very counter-productively. They tend to be very easy to use; you just run them and make your code perform the expensive operation you want to optimize and you get back nice results like so:
If the small buffer optimization confuses you a bit, a simple illustration is like this:
struct SomeString
{
// Pre-allocates (always) some memory in advance to avoid additional
// heap allocs.
char small_buffer[some_small_fixed_size] = {};
// Will point to small buffer until string gets large.
char* ptr = small_buffer;
};
The allocation of the small buffer is unavoidable, but it doesn't require separate calls to malloc/new/new[]. And it's not allocated separately on the heap from the string object itself (if it is allocated on heap). So both of the examples that you showed involve, at most, a single heap allocation (unless your standard library implementation is FUBAR -- edit: or one that Viktor is using). What the first example has conceptually on top of that is a fill/loop (could be implemented as a very efficient intrinsic in assembly but loopy/linear time stuff nevertheless) unless the optimizer eliminates it.
String Optimization
So is the first method faster than the other one? Or are there any other better methods?
You can write your own string type which uses an SBO with, say, 256 bytes for the small buffer which is typically going to be much larger than any std::string optimization. Then you can avoid heap allocations entirely for your 131-length case.
template <class Char, size_t SboSize=256>
class TempString
{
private:
// Stores the small buffer.
Char sbo[SboSize] = {};
// Points to the small buffer until num > SboSize.
Char* ptr = sbo;
// Stores the length of the string.
size_t num = 0;
// Stores the capacity of the string.
size_t cap = SboSize;
public:
// Destroys the string.
~TempString()
{
if (ptr != sbo)
delete[] ptr;
}
// Remaining implementation left to reader. Note that implementing
// swap requires swapping the contents of the SBO if the strings
// point to them rather than swapping pointers (swapping is a
// little bit tricky with SBOs involved, so be wary of that).
};
That would be ill-suited for persistent storage though because it would blow up memory use (ex: requiring 256+ bytes just to store a string with one character in it) if you stored a bunch of strings persistently in a container. It's well-suited for temporary strings though you transfer into and out of function calls. I'm primarily a gamedev so rolling our own alternatives to the standard C++ library is quite normal here given our requirements for real-time feedback with high graphical fidelity. I wouldn't recommend it for the faint-hearted though, and definitely not without a profiler. This is a very practical and viable option in my field although it might be ridiculous in yours. The standard lib is excellent but it's tailored for the needs of the entire world. You can usually beat it if you can tailor your code very specifically to your needs and produce more narrowly-applicable code.
Actually, even std::string with SBOs is rather ill-suited for persistent storage anyway and not just TempString above because if you store like std::unordered_map<std::string, T> and std::string uses a 16-byte SBO inflating sizeof(std::string) to 32 bytes or more, then your keys will require 32 bytes even if they just store one character fitting only two strings or less in a single cache line on traversal of the hash table. That's a downside to using SBOs. They can blow up your memory use for persistent storage that's part of your application state. But they're excellent for temporaries whose memory is just pushed and popped to/from stack in a LIFO alloc/dealloc pattern which only requires incrementing and decrementing a stack pointer.
If you want to optimize the storage of many strings though from a memory standpoint, then it depends a lot on your access patterns and needs. However, a fairly simple solution is like so if you want to just build a dictionary and don't need to erase specific strings dynamically:
// Just using a struct for simplicity of illustration:
struct MyStrings
{
// Stores all the characters for all the null-terminated strings.
std::vector<char> buffer;
// Stores the starting index into the buffer for the nth string.
std::vector<std::size_t> string_start;
// Inserts a null-terminated string to the buffer.
void insert(const std::string_view str)
{
string_start.push_back(buffer.size());
buffer.insert(buffer.end(), str.begin(), str.end());
buffer.push_back('\0');
}
// Returns the nth null-terminated string.
std::string_view operator[](int32_t n) const
{
return {buffer.data() + string_start[n]};
}
};
Another common solution that can be very useful if you store a lot of duplicate strings in an associative container or need fast searches for strings that can be looked up in advance is to use string interning. The above solution can also be combined to implement an efficient way to store all the interned strings. Then you can store lightweight indices or pointers to your interned strings and compare them immediately for equality, e.g., without involving any loops, and store many duplicate references to strings that only cost the size of an integer or pointer.

Space complexity of str.substr() in C++

What is the space complexity of the str.substr() function and how does it compare to str.erase()?
Wondering because I was running a code on leetcode, and used 150MB of memory when I used the substr function:
num = num.substr(1,num.size());
As soon as I removed this function and instead used the erase function, while changing nothing else in my code, the memory usage fell to 6.8MB. Updated code with erase function:
num = num.erase(0,1);
num = num.substr(1,num.size());
substr creates a copy of string without the first character, so out of the 1 character less after the call you have (almost) two times the initial string
(1) the string are shared, so after the assignment the initial string is deleted if it is not referenced from elsewhere, but before the assignment you had the two versions in memory requiring memory.
num = num.erase(0,1);
modifies the string, so that need only one version of the string during the execution
note it is the same as doing
num.erase(0,1);
(1): from Pete Becker remark since C++11 the internal representation of std::basic_string is explicitly not allowed to be shared
Space complexity of str.substr() in C++
Technically, it depends on the type of str.
Reasonably, there should be no overhead on top of the size of the output. The space complexity of std::string is linear in relation to the length of the string.

Should I always declare char array bigger than my string is?

So I know that you should declare char arrays to be one element bigger than the word you want to put there because of the \0 that has to be at the end, but what about char arrays that I don't want to use as words?
I'm currently writing a program in which i store an array of keyboard letters that have some function assigned to them. Should I still end this array with \0?
That is probably not necessary.
A null terminator is not a requirement for arrays of char; it is a requirement for "C-strings", things that you intend to use as unitary blobs of data, particularly if you intend to pass them to C API functions. It's the conventional way that the "length" of the string is determined.
But if you just want a collection of chars to use independently then knock yourself out.
We cannot see your code, but it sounds to me like you don't want or need it in this case.
The array should have, at least, the same number of elements as the data you will put there. So, if:
you don't need the '\0'
you won't place it there
you won't use routines that will depend on an '\0' to tell you the array size
... you are good with not using the trailing '\0'
If you're using C++, you should probably just use std::string or std::vector<char> or even std::array<char> and not worry about terminators.
It depends on usage. If you want to use it not as just byte array, but as c-string with probably usage of some standard string algorithms (strcmp and so on), or output to the stream - your array should ends with \0.
It depends on what you are trying to do, if you are trying to define a C-style string then, you need the terminator since the C-library won't be able to calculate the size of the string and other things if you don't...
In C++, though, the size of the string is already stored inside the std::string class along with the dynamic array of chars...
But if you just need a free container for storing characters where you don't need it to do C-string-like things... You are free to do:
char hello[128]; // 128 elements, do anything with them...
Without the terminator...
In your case, you are storing values, not creating a string, and you won't probably treat it as a string either, so doing it without the null-terminator, suffices...
\0 will certainly make it easier when wanting to use functions like strlen, strcmp, strcatand the like, but is not required.
An aside - We have an entire enterprise code base built upon strings (char arrays) with no null terminators in the database. Works just fine.

c++14 - Is this a good way to prepend a char on a string?

If I wanted to add char c onto the beginning of string s, is the following good practice?
string s = "oo";
char c = 'f';
s = c + s;
In the Question "Prepend std::string" on SO the answers that suggested doing this were less well received than the top answer, which suggested using the member-function .insert().
Is there a reason besides efficiency (s = c + s is not efficient since all the contents of string s must be copied)?
Since both perform the same operation, what reason could there be besides efficiency? c+s will create a temporary string, thus requiring a copy of every character in both c and s, and potentially a heap allocation. The temporary will then be moved into the given object, which will have its current memory deallocated (if any). These are not cheap operations.
By contrast, insert will only perform a heap allocation if there is insufficient capacity for the new character. You'll still have the copying going on, since you're inserting at the beginning. But that's about it. It is as efficient as insertion at the head of a contiguous array gets.
The s = c + s operation would create a temporary object probably dynamically allocating memory on heap. Do the required append operation and then copy it back to the string variable. More number of instructions and memory operations are involved.
Memory operations like allocating and de-allocating memory are costly.
Insert would reallocate memory only if not enough contiguous memory is available for the string. Worst case it would still match to s = c + s approach.
Although it is not much of a performance issue (considering the worst case) it is more elegant and easy to understand from a programmers perspective.
Note also that there is nothing to stop an implementation of string from allowing limited appends at both ends without needing to move the contents. The default implementations do not do this, but some implementation might reserve extra space at the front of the string the first time you prepend, so that a subsequent prepend is "free". There are vector implementations that do this out there.

Difference between std::string(size, '\0') and s.resize(size)?

Unlike std::vector, std::string does not provide a unary constructor that takes a size:
std::string s(size); // ERROR
Is there any difference between:
std::string s(size, '\0');
and
std::string s;
s.resize(size);
in terms of their performance on common implementations?
Will resize initialize the string to all zero characters or will it leave them an unspecified value?
If all zero, is there any way to construct a string of a given size, but leave the characters with an unspecified value?
There is a difference, as in std::string s(size, '\0');, all of the memory needed for the string can be allocated at once. However, with the second example, if size is greater than the amount of characters stored for small string optimization, an extra allocation may have to be performed, although this is implementation defined, and will definitely not be more performant in that regard in a standard-compliant C++ 17 implementation. However, the first example is more consise, and may be more performant, so it is probably preferable. When calling s.resize(size);, all new characters will be initialized with char's default constructor, aka '\0'. There is no way to initialize a string with unspecified values.
The actual answer would be implementation-based, but I'm fairly sure that std::string s(size, '\0'); is faster.
std::string s;
s.resize(size);
According to the documentation for std::string.
1) Default constructor. Constructs empty string (zero size and unspecified capacity).
The default constructor will create a string with an "unspecified capacity". My sense here is that the implementation is free to determine a default capacity, probably in the realm of 10-15 characters (totally speculation).
Then in the next line, you will reallocate the memory (resize) with the new size if the size is greater than the current capacity. This is probably not what you want!
If you really want to find out definitively, you can run a profiler on the two methods.
There is already a good answer from DeepCoder.
For the records however, I'd like to point out that strings (as for vectors) there are two distinct notions:
the size(): it's the number of actual (i.e. meaningful) characters in the string. You can change it using resize() (to which you can provide a second parameter to say what char you want to use as filler if it should be other than '\0')
the capacity(): it's the number of characters allocated to the string. Its at least the size but can be more. You can increase it with reserve()
If you're worried about allocation performance, I believe it's better to play with the capacity. The size should really be kept for real chars in the string not for padding chars.
By the way, more generally, s.resize(n) is the same as s.resize(n, char()). So if you'd like to fill it on the same way at construction, you could consider string s(n, char()). But as long as you don't use basic_string<T> for T being different from characters, your '\0' just does the trick.
Resize does not leave elements uninitialized. According to the documentation: http://en.cppreference.com/w/cpp/string/basic_string/resize
s.resize(size) will value-initialize each appended character. That will cause each element of the resized string to be initialized to '\0'.
You would have to measure the performance difference of your specific C++ implementation to really decide if there's a worthwhile difference or not.
After looking at the machine generated by Visual C++ for an optimized build, I can tell you the amount of code for either version is similar. What seems counter intuitive is that the resize() version measured faster for me. Still, you should check your own compiler and standard library.