char*/string concatenation without copying? - c++

I would like to concatenate 2 strings in C or C++ without new memory allocation and copying. Is it possible?
Possible C code:
char* str1 = (char*)malloc(100);
char* str2 = (char*)malloc(50);
char* str3 = /* some code that concatenates these 2 strings
without copying to occupy a continuous memory region */
Then, when I don't need them any more, I just do:
free(str1);
free(str2);
Or if possible, I would like to achieve the same in C++, using std::string or maybe char*, but using new and delete (possibly void operator delete ( void* ptr, std::size_t sz ) operator (C++14) on the str3).
There are a lot of questions about strings concatenation, but I haven't found one that asks the same.

No, it is not possible
In C, malloc operations return blocks of memory that have no relationship to each other. But in C, strings must be a continuous array of bytes. So there is no way to extend str1 without copying, let alone concatenate.
For C++, perhaps ropes may be of interest: See this answer.
Ropes are allocated in chunks that do not have to be contiguous. This supports O(1) concatenation. However, the accessors make it appear as a single string of bytes. I'm certain that to convert ropes back to std::string or C style strings will take a copy however, but this is probably the closest to what you want.
Also, it is probably a premature optimization to worry about the costs of copying a few strings around. Unless you are moving lots of data, it won't matter

Text concatenation is possible by writing your own string data structure. Easier in C++ than C.
struct My_String
{
std::vector<char *> text_fragments;
};
You would have to implement all the text manipulation and searching algorithms based on this data structure. Nothing in the C library could be applied to the My_String structure. The std::string in C++ would not be compatible.
One of the issues is how to handle text modification. If one of the text fragments is a constant literal (that can't be modified), it would need to be copied before it could be modified. But copying is against the requirements. :-(

A "string" in C is a an array of chars with a null char at the end. And an array is "a data structure that lets you store one or more elements consecutively in memory". GNU C reference
You cannot concatenate two arrays that are not in consecutive memory blocks without copying one of them. You can do it however without allocating new memory. E.g.
char* str1 = malloc(100); // size 100 bytes, uninitialised
str1[0] = '\0'; // string length 0, size of str1 100
strcat(str1, "a"); // string length 1, size of str1 still 100
strcat(str1, "b"); // string length 2, size of str1 still 100
You could if you want retrieve chars of 2 strings as if they were one without copying or reallocating. Here is an example function to do that (simple example, don't use in production code)
char* str1 = (char*)malloc(100);
char* str2 = (char*)malloc(50);
char get_char(int i) {
if (i > 0 && i < 100) {
return str1[i];
}
if (i >= 100 && i < 150) {
return str2[i-100];
}
return 0;
}
But in such a case you couldn't have a char* str3 to perform pointer arithmetic with and access all 150 chars.

Tags C and C++ are contradictory. In C, I'd recommend exploring realloc. You can code something along following lines:
char* str = malloc(50);
str = realloc(ptr, 55);
If you are lucky, the realloc call will not reallocate new memory and just 'extened' the already allocated segment, but there is no guarantee for this. This way you at at least have a shot of avoiding reallocations of the string. You will still have to copy contents of the second string into neweley allocated memory.

Related

Memory efficiency of C++ arrays

Somewhere in my brainstem a voice whispers:
In C++, an array does not need more memory than the number of elements
need.
std::string str = "aabbcc";
std::array<std::string, 3> str_array = {"aa", "bb", "cc"};
Accordingly, both should have the same size, because (unlike in Java), there is no separate size field or similar. But I haven't found a reference.
Is this true? Under which circumstances is it not?
Storing strings in any language is more complicated than you think. A C++ std::string must provide you contiguous storage for the contents. Apart from that, std::string can hold more things, like pointer/iterator to the last character, number of characters in it, etc. std::string::size is required to be O(1), so it must store more information than just a buffer. Also, most standard library implementations provide SSO (small string optimization). When SSO is enabled, std::string allocates a small buffer, to avoid unneccessary dynamic allocations. You can also reserve more memory than you need. Lets say, you need to collect 800-1000 characters in loop. You can do it like this:
std::string str;
for(...)
str += some_character;
But this will cause unneccessary memory allocations and deallocations. If you can estimate number of characters you want to store, you should reserve memory.
std::string str;
str.reserve(1000);
for(...)
str.push_back(some_character);
Then, you can always shrink_to_fit, to save memory:
str.shrink_to_fit();
There are also other things you must be aware of:
reserve increases capacity, but size stays the same. It means, that std::string must also store (or be able to calculate) for how many more characters buffer capacity allows.
string literals are null terminated
std::basic_string::c_str must return null terminated array of characters, so it is possible that std::string also contains null terminator (unluckily I am not sure how it is done)
there are more encodings and characters sets - ASCII is just one of them. UTF-8 and UTF-16 encoded strings may need to use few stored elements to add up to one code point, but this is more complicated.
In C++, an array does not need more memory than the number of elements need.
This is true. A raw array has a size equal to the size of it's element type times the number of elements. So,
int array[10];
has a size of sizeof(int) * std::size(array). std::array is the same but it is allowed to have padding so
std::array<int, 10> array;
has the size of sizeof(int) * std::size(array) + P where P is some integer amount of padding.
Your example though isn't quite the same thing.
A std::string is a container. It has it's own size that is separate of what it contains. So sizeof(std::string) will always be the same thing regardless of how many characters are in the string. So ignoring short string optimization
std::string str = "aabbcc";
Takes of sizeof(std::string) plus however much the string allocated for the underlying c-string. That is not the same value as
std::array<std::string, 3> str_array = {"aa", "bb", "cc"};
Since you now have 3 * sizeof(std::string) plus whatever each string allocated.
Accordingly, both should have the same size (e.g. 6 bytes),
Not a correct deduction.
The memory used by a std::string, if you want to call its size, consists of at least a pointer and the memory allocated to hold the data.
The memory allocated to hold the data can also include the space required to hold the terminating null character.
Given
std::string s = "aabbcc";
std::string a = "aa";
std::string b = "bb";
std::string c = "cc";
mem(s) != mem(a) + mem(b) + mem(c)
Virtually every string can hold following info:
The size of the string i.e. num of chars it contains.
The capacity of memory holding the string's chars.
The value of the string.
Additionally it may also hold:
A copy of it's allocator and reference count for the value.
They don’t have the same size. Strings are saved null-terminated, giving you an extra byte for each string.

How to code a strcat function that works with two dynamic arrays

As we know, the strcat function concatinates one c-string onto another to make one big c-string containing two others.
My question is how to make a strcat function that works with two dynamically allocated arrays.
The desired strcat function should be able to work for any sized myStr1 and myStr2
//dynamic c-string array 1
char* myStr1 = new char [26];
strcpy(myStr1, "The dog on the farm goes ");
//dynamic c-string array 2
char* myStr2 = new char [6];
strcpy(myStr2, "bark.");
//desired function
strcat(myStr1,myStr2);
cout<<myStr1; //would output 'The dog on the farm goes bark.'
This is as far as I was able to get on my own:
//*& indicates that the dynamic c-string str1 is passed by reference
void strcat(char*& str1, char* str2)
{
int size1 = strlen(str1);
int size2 = strlen(str2);
//unknown code
//str1 = new char [size1+size2]; //Would wipe out str1's original contents
}
Thanks!
You need first to understand better how pointers work. Your code for example:
char* myStr1 = new char [25];
myStr1 = "The dog on the farm goes ";
first allocates 25 characters, then ignores the pointer to that allocated area (the technical term is "leaks it") and sets myStr1 to point to a string literal.
That code should have used strcpy instead to copy from the string literal into the allocated area. Except that the string is 25 characters so you will need to allocate space for at least 26 as one is needed for the ASCII NUL terminator (0x00).
Correct code for that part should have been:
char* myStr1 = new char [26]; // One more than the actual string length
strcpy(myStr1, "The dog on the farm goes ");
To do the concatenation of C strings the algorithm could be:
measure the lengths n1 and n2 of the two strings (with strlen)
allocate n1+n2+1 charaters for the destination buffer (+1 is needed for the C string terminator)
strcpy the first string at the start of the buffer
strcat the second string to the buffer (*)
delete[] the memory for the original string buffers if they are not needed (if this is the right thing to do or not depends on who is the "owner" of the strings... this part is tricky as the C string interface doesn't specify that).
(*) This is not the most efficient way. strcat will go through all the characters of the string to find where it ends, but you already know that the first string length is n1 and the concatenation could be done instead with strcpy too by choosing the correct start as buffer+n1. Even better instead of strcpy you could use memcpy everywhere if you know the count as strcpy will have to check each character for being the NUL terminator. Before getting into this kind of optimization however you should understand clearly how things work... only once the string concatenation code is correct and for you totally obvious you are authorized to even start thinking about optimization.
PS: Once you get all this correct and working and efficient you will appreciate how much of a simplification is to use std::string objects instead, where all this convoluted code becomes just s1+s2.
You allocate memory and make your pointers point to that memory. Then you overwrite the pointers, making them point somewhere else. The assignment of e.g. myStr1 causes the variable to point to the string literal instead of the memory you allocated. You need to copy the strings into the memory you have allocated.
Of course, that copying will lead to another problem, as you seem to forget that C-strings need an extra character for the terminator. So a C-string with 5 characters needs space for six characters.
As for your concatenation function, you need to do copying here too. Allocate enough space for both strings plus a single terminator character. Then copy the first string into the beginning of the new memory, and copy the second string into the end.
Also you need a temporary pointer variable for the memory you allocate, as you otherwise "would wipe out str1's original contents" (not strictly true, you just make str1 point somewhere else, losing the original pointer).

Assigning strings of any size to a pointer to char

Before all, I must state that I'm a beginner with C++ and programming overall.
I'll get straight to the point. I'm wondering if it's possible to assign a string of characters of any size to a pointer to a character (not arrays, just a char * pointer). Would that violate any Memory Addresses?
The book I'm learning from doesn't seem to say anything about that. I can't seem to find anything on Google either.
You have your character pointer and want to dynamically create C strings
char *str;
say. This pointer will be used to point to the first character of the string. The string is a series of sequential characters (bytes) in memory. What we what to achieve this in memory:
str -> +---+---+---+---+---+----+
| H | E | L | L | O | \0 |
+---+---+---+---+---+----+
Note the final byte - This byte has the value 0 and is call the null character - it represents the end of the string and enables one to easilty know when we have come to the end.
To give str a value ne allocate this memory. In C++ this is done by the new operator like this
str = new char[6];
Note new has two versions new[] and new - one is to allocate an array of object, the other is to allocate a single object. ALWAYS use delete[] when you have allocated it with new[], similarly new/delete should be used. DO NOT MIX new[] with delete, and new with delete[]
This will allocate an array of 6 characters to place the string into. To place the characters into the string we cold do this.
str[0] = `H`;
str[1] = `E];
...
str[5] = 0;
But this would be tedious. Instead we can use strcpy to do this for us:
strcpy(str, "hello");
It knows all about the null character. There is a range of functions that operate on these types of strings - please see string
This is C strings. Once upon a time somebody invented this new language called C++. This language uses a different idea called objects that makes this stuff a lot easier. You need to look at the standard template library (or STL). Notes on these strings can be found at string. There is lots of goodies in the STL - here is a reference STL
Hope this helps
A char pointer can point to a string of any length, because the length of the string is determined by when you run into a NUL (0) byte in the string. When you store strings this way, it becomes a C-string. For instance:
const char* str = NULL; // at this point,
// doesn't point to anything (not even a string)
str = ""; // valid
str = "a"; // valid
str = "hello"; // valid
str = "farewell, cruel world"; // valid

connecting chars*

How would I connect to char* strings to each other.
For example:
char* a="Heli";
char* b="copter";
How would I connect them to one char c which should be equal to "Helicopter" ?
strncat
Or use strings.
size_t newlen = strlen(a) + strlen(b);
char *r = malloc(newlen + 1);
strcpy(r, a);
strcat(r, b);
In C++:
std::string foo(a);
std::string bar(b);
std::string result = foo+bar;
If your system has asprintf() (pretty common these days), then it's easy:
char* p;
int num_chars = asprintf(&p, "%s%s", a, b);
The second argument is a format string akin to printf(), so you can mix in constant text, ints, doubles etc., controlling field widths and precision, padding characters, justification etc.. If num_chars != -1 (an error), then p then points to heap-allocated memory that can be released with free(). Using asprintf() avoids the relatively verbose and error-prone steps to calculate the required buffer size yourself.
In C++:
std::string result = std::string(a) + b;
Note: a + b adds two pointers - not what you want, hence at least one side of the + operator needs to see a std::string, which will ensure the string-specific concatenation operator is used.
(The accepted answer of strncat is worth further comment: it can be used to concatenate more textual data after an ASCIIZ string in an existing, writeable buffer, in-so-much as that buffer has space to spare. You can't safely/portably concatenate onto a string literal, and it's still a pain to create such a buffer. If you do it using malloc() to ensure it's exactly the right length, then strcat() can be used in preference to strncat() anyway.)

Difference between string and char[] types in C++

For C, we use char[] to represent strings.
For C++, I see examples using both std::string and char arrays.
#include <iostream>
#include <string>
using namespace std;
int main () {
string name;
cout << "What's your name? ";
getline(cin, name);
cout << "Hello " << name << ".\n";
return 0;
}
#include <iostream>
using namespace std;
int main () {
char name[256];
cout << "What's your name? ";
cin.getline(name, 256);
cout << "Hello " << name << ".\n";
return 0;
}
(Both examples adapted from http://www.cplusplus.com.)
What is the difference between these two types in C++? (In terms of performance, API integration, pros/cons, ...)
A char array is just that - an array of characters:
If allocated on the stack (like in your example), it will always occupy eg. 256 bytes no matter how long the text it contains is
If allocated on the heap (using malloc() or new char[]) you're responsible for releasing the memory afterwards and you will always have the overhead of a heap allocation.
If you copy a text of more than 256 chars into the array, it might crash, produce ugly assertion messages or cause unexplainable (mis-)behavior somewhere else in your program.
To determine the text's length, the array has to be scanned, character by character, for a \0 character.
A string is a class that contains a char array, but automatically manages it for you. Most string implementations have a built-in array of 16 characters (so short strings don't fragment the heap) and use the heap for longer strings.
You can access a string's char array like this:
std::string myString = "Hello World";
const char *myStringChars = myString.c_str();
C++ strings can contain embedded \0 characters, know their length without counting, are faster than heap-allocated char arrays for short texts and protect you from buffer overruns. Plus they're more readable and easier to use.
However, C++ strings are not (very) suitable for usage across DLL boundaries, because this would require any user of such a DLL function to make sure he's using the exact same compiler and C++ runtime implementation, lest he risk his string class behaving differently.
Normally, a string class would also release its heap memory on the calling heap, so it will only be able to free memory again if you're using a shared (.dll or .so) version of the runtime.
In short: use C++ strings in all your internal functions and methods. If you ever write a .dll or .so, use C strings in your public (dll/so-exposed) functions.
Arkaitz is correct that string is a managed type. What this means for you is that you never have to worry about how long the string is, nor do you have to worry about freeing or reallocating the memory of the string.
On the other hand, the char[] notation in the case above has restricted the character buffer to exactly 256 characters. If you tried to write more than 256 characters into that buffer, at best you will overwrite other memory that your program "owns". At worst, you will try to overwrite memory that you do not own, and your OS will kill your program on the spot.
Bottom line? Strings are a lot more programmer friendly, char[]s are a lot more efficient for the computer.
Well, string type is a completely managed class for character strings, while char[] is still what it was in C, a byte array representing a character string for you.
In terms of API and standard library everything is implemented in terms of strings and not char[], but there are still lots of functions from the libc that receive char[] so you may need to use it for those, apart from that I would always use std::string.
In terms of efficiency of course a raw buffer of unmanaged memory will almost always be faster for lots of things, but take in account comparing strings for example, std::string has always the size to check it first, while with char[] you need to compare character by character.
I personally do not see any reason why one would like to use char* or char[] except for compatibility with old code. std::string's no slower than using a c-string, except that it will handle re-allocation for you. You can set it's size when you create it, and thus avoid re-allocation if you want. It's indexing operator ([]) provides constant time access (and is in every sense of the word the exact same thing as using a c-string indexer). Using the at method gives you bounds checked safety as well, something you don't get with c-strings, unless you write it. Your compiler will most often optimize out the indexer use in release mode. It is easy to mess around with c-strings; things such as delete vs delete[], exception safety, even how to reallocate a c-string.
And when you have to deal with advanced concepts like having COW strings, and non-COW for MT etc, you will need std::string.
If you are worried about copies, as long as you use references, and const references wherever you can, you will not have any overhead due to copies, and it's the same thing as you would be doing with the c-string.
One of the difference is Null termination (\0).
In C and C++, char* or char[] will take a pointer to a single char as a parameter and will track along the memory until a 0 memory value is reached (often called the null terminator).
C++ strings can contain embedded \0 characters, know their length without counting.
#include<stdio.h>
#include<string.h>
#include<iostream>
using namespace std;
void NullTerminatedString(string str){
int NUll_term = 3;
str[NUll_term] = '\0'; // specific character is kept as NULL in string
cout << str << endl <<endl <<endl;
}
void NullTerminatedChar(char *str){
int NUll_term = 3;
str[NUll_term] = 0; // from specific, all the character are removed
cout << str << endl;
}
int main(){
string str = "Feels Happy";
printf("string = %s\n", str.c_str());
printf("strlen = %d\n", strlen(str.c_str()));
printf("size = %d\n", str.size());
printf("sizeof = %d\n", sizeof(str)); // sizeof std::string class and compiler dependent
NullTerminatedString(str);
char str1[12] = "Feels Happy";
printf("char[] = %s\n", str1);
printf("strlen = %d\n", strlen(str1));
printf("sizeof = %d\n", sizeof(str1)); // sizeof char array
NullTerminatedChar(str1);
return 0;
}
Output:
strlen = 11
size = 11
sizeof = 32
Fee s Happy
strlen = 11
sizeof = 12
Fee
Think of (char *) as string.begin(). The essential difference is that (char *) is an iterator and std::string is a container. If you stick to basic strings a (char *) will give you what std::string::iterator does. You could use (char *) when you want the benefit of an iterator and also compatibility with C, but that's the exception and not the rule. As always, be careful of iterator invalidation. When people say (char *) isn't safe this is what they mean. It's as safe as any other C++ iterator.
Strings have helper functions and manage char arrays automatically. You can concatenate strings, for a char array you would need to copy it to a new array, strings can change their length at runtime. A char array is harder to manage than a string and certain functions may only accept a string as input, requiring you to convert the array to a string. It's better to use strings, they were made so that you don't have to use arrays. If arrays were objectively better we wouldn't have strings.