Create a std:array of char from a char pointer without copy - c++

In C++, is it possible to create a std::array of char (or a std::vector) from a char pointer without copying data?
I know I can write something like that but data are copied into the array, which is not optimized:
#define BUFFER_SIZE 256
...
char *buffer;
std::array<char, BUFFER_SIZE> data(buffer, buffer + BUFFER_SIZE);
I would like a solution to pass data by pointer, without copying data.

No not really. Arrays and vectors own their data. Strings are the same.
I would like a solution to pass data by pointer, without copying data.
Well You can always pass by reference but of course if you change it in a routine .... well ...... it changes it.
For characters you can use a COW string (Moooooo!) or Copy On Write string. The idea here is that a copy of the string data itself is only made if you actually try to change the string. So it's kind of the best of both worlds. Unfortunately (or fortunately depending on your point of view) the standard library string isn't one. If I remember correctly some older STL strings used to be implemented like that. You can of course write your own in short order if you are reasonably C++ proficient, or you can likely find an implementation floating around somewhere. I have one I've used for like 30 years which works well enough for my purposes.

Related

CString or char array which one is better in terms of memory

I read somewhere that usage of CString is costly. Can you calrify it with an example. Also among CString and char array, which is better in terms of memory.
CString in addition to array of chars (or wide chars) contains string size, allocated buffer size, and reference counter (serving additionally as a lock flag). The buffer containing the array of chars may be significantly larger than the string it contains -- it allows to reduce the number of time-costly allocation calls. In addition, when the CString is set to be zero-sized, it still contains two wchar characters.
Naturally, when you compare the size of CString with the size of corresponding C-style array, the array will be smaller. However, if you want to manipulate your string as extensively as CString allows, you will eventually define your own variables for string size, buffer size and sometimes refcounter and/or guard flags. Indeed, you need to store your string size to avoid calling strlen each time you need it. You need to store separately your buffer size if you allow your buffer to be larger than the string length, and avoid calling reallocs each time you add to or subtract from the string. And so on -- you trade some small size increase for significant increases in speed, safety and functionality.
So, the answer depends on what you are going to do with the string. Suppose you want a string to store the name of your class for logging -- there a C-style string (const and static) will do fine. If you need a string to manipulate and use it extensively with MFC or ATL-related classes, use CString family types. If you need to manipulate string in the "engine" parts of your application that are isolated from its interface, and may be converted to other platforms, use std::string or write your own string type to suit your particular needs (this can be really useful when you write the "glue" code to place between the interface and the engine, otherwise std::string is preferable).
CString is from MFC framework specific to windows. std::string is from c++ standard. They are library classes for managing strings in memory. std::string will provide you code portability across platforms.
Using raw array is always good for memory however one has to do operations on strings and it becomes difficult with raw array, consider out of bounds check, get the string length, copy the array or change the size because the string may grow, deleting the array, etc. For all these problem string utility class are good wrapper. The string class will keep the actual string in heap and you have the overhead of the string class itself. However that will provide you functionality to mange the string memory which anyway you have to write by hand.
Prefer std::string if you can, if not, use CString.
In almost all cases I encourage novice programmers to use std::string or CString(*). First they will do significantly less errors. I have seen many buffer overruns, memory invalidation or memory leaks, because of erroneous use of C arrays.
So which is more efficient, CString / std::string or raw character arrays? Memory wise, generally speaking, all CString ans std::string have more is one integer for the size. The question is does it matter?
So which is more efficient in terms of performance? Well it depends on what you are doing with it and how you are using your C-arrays. But passing CString or std::string arround can be computationally more efficient than C-arrays. The problem with C-arrays is that you can't be sure of who owns the memory and what type (heap/stack/literal) it is. Defensive programming results in more copies of arrays, you know, just to be sure that the memory you hold will be valid for the entire duration of when it is needed.
Why is std::string or CString more efficient than C-arrays, if they are passed around by value? This is a bit more complicated and for totally different reasons. For CString, this is simple, it implemented as a COW (copy on write) object. So when you have 5 objects that originate for one CString, it will not use more memory that one, until you start to make change on one object. std::string has stricter requirements and thus it is not allowed to share memory with other std:: string objects. But if you have a newer compiler, std::string should implement the move semantic and thus returning a string from a function will only result in a copy of the pointer not reallocation.
There are very few cases where raw C arrays are good and practical idea.
*) If you are already programming against MFC, why not just use CString.

Why does std::fstream use char*?

I'm writing a small program that reads the bytes from a file in binary file in groups of 16 bytes (please don't ask why), modifies them, and then writes them to another file.
The fstream::read function reads into a char * buffer, which I was initially passing to a function that looks like this:
char* modify (char block[16], std::string key)
The modification was done on block which was then returned. On roaming the posts of SO, I realized that it might be a better idea to use std::vector<char>. My immediate next worry was how to convert a char * to a std::vector<char>. Once again, SO gave me an answer.
But now what I'm wondering is: If its such a good idea to use std::vector<char> instead of char*, why do the fstream functions use char* at all?
Also, is it a good idea to convert the char* from fstream to std::vector<char> in the first place?
EDIT: I now realize that since fstream::read is used to write data into objects directly, char * is necessary. I must now modify my question. Firstly, why are there no overloaded functions for fstream::read? And secondly, in the program that I've written about, which is a better option?
To use it with a vector, do not pass a pointer to the vector. Instead, pass a pointer to the vector content:
vector<char> v(size);
stream.read(&v[0], size);
fstream() functions let you use char*s so you can point them at arbitrary pre-allocated buffers. std::vector<char> can be sized to provide an appropriate buffer, but it will be on the heap and there's allocation costs involved with that. Sometimes too you may want to read or write data to a specific location in memory - even in shared memory - rather than accepting whatever heap memory the vector happens to have allocated. Further, you may want to use fstream without having included the vector header... it's nice to be able to avoid unnecessary includes as it reduces compilation time.
As your buffers are always 16 bytes in size, it's probably best to allocate them as char [16] data members in an appropriate owning object (if any exists), or on the stack (i.e. some function's local variable).
vector<> is more useful when the alternative is heap allocation - whether because the size is unknown at compile time, or is particularly large, or you want more flexible control of the memory lifetime. It's also useful when you specifically want some of the other vector functionality, such as ability to change the number of elements afterwards, to sort the bytes etc. - it seems very unlikely you'll want to do any of that so a vector raises questions in the mind of the person reading your code about what you'll do for no good purpose. Still, the choice of char[16] vs. vector appears (based on your stated requirements) more a matter of taste than objective benefit.

Is there a way to pass ownership of an existing char* in heap to a std::string? [duplicate]

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

Access Violation Using memcpy or Assignment to an Array in a Struct

Update 2:
Well I’ve refactored the work-around that I have into a separate function. This way, while it’s still not ideal (especially since I have to free outside the function the memory that is allocated inside the function), it does afford the ability to use it a little more generally. I’m still hoping for a more optimal and elegant solution…
Update:
Okay, so the reason for the problem has been established, but I’m still at a loss for a solution.
I am trying to figure out an (easy/effective) way to modify a few bytes of an array in a struct. My current work-around of dynamically allocating a buffer of equal size, copying the array, making the changes to the buffer, using the buffer in place of the array, then releasing the buffer seems excessive and less-than optimal. If I have to do it this way, I may as well just put two arrays in the struct and initialize them both to the same data, making the changes in the second. My goal is to reduce both the memory footprint (store just the differences between the original and modified arrays), and the amount of manual work (automatically patch the array).
Original post:
I wrote a program last night that worked just fine but when I refactored it today to make it more extensible, I ended up with a problem.
The original version had a hard-coded array of bytes. After some processing, some bytes were written into the array and then some more processing was done.
To avoid hard-coding the pattern, I put the array in a structure so that I could add some related data and create an array of them. However now, I cannot write to the array in the structure. Here’s a pseudo-code example:
main() {
char pattern[]="\x32\x33\x12\x13\xba\xbb";
PrintData(pattern);
pattern[2]='\x65';
PrintData(pattern);
}
That one works but this one does not:
struct ENTRY {
char* pattern;
int somenum;
};
main() {
ENTRY Entries[] = {
{"\x32\x33\x12\x13\xba\xbb\x9a\xbc", 44}
, {"\x12\x34\x56\x78", 555}
};
PrintData(Entries[0].pattern);
Entries[0].pattern[2]='\x65'; //0xC0000005 exception!!! :(
PrintData(Entries[0].pattern);
}
The second version causes an access violation exception on the assignment. I’m sure it’s because the second version allocates memory differently, but I’m starting to get a headache trying to figure out what’s what or how to get fix this. (I’m currently working around it by dynamically allocating a buffer of the same size as the pattern array, copying the pattern to the new buffer, making the changes to the buffer, using the buffer in the place of the pattern array, and then trying to remember to free the—temporary—buffer.)
(Specifically, the original version cast the pattern array—+offset—to a DWORD* and assigned a DWORD constant to it to overwrite the four target bytes. The new version cannot do that since the length of the source is unknown—may not be four bytes—so it uses memcpy instead. I’ve checked and re-checked and have made sure that the pointers to memcpy are correct, but I still get an access violation. I use memcpy instead of str(n)cpy because I am using plain chars (as an array of bytes), not Unicode chars and ignoring the null-terminator. Using an assignment as above causes the same problem.)
Any ideas?
It is illegal to attempt to modify string literals. Your
Entries[0].pattern[2]='\x65';
line attempts exactly that. In your second example you are not allocating any memory for the strings. Instead, you are making your pointers (in the struct objects) to point directly at string literals. And string literals are not modifiable.
This question gets asked several times every day. Read Why is this string reversal C code causing a segmentation fault? for more details.
The problem boils down to the fact that a char[] is not a char*, even if the char[] acts a lot like a char* in expressions.
Other answers have addressed the reason for the error: you're modifying a string literal which is not allowed.
This question is tagged C++ so the easy way to solve your problem is to use std::string.
struct ENTRY {
std::string pattern;
int somenum;
};
Based on your updates, your real problem is this: You want to know how to initialize the strings in your array of structs in such a way that they're editable. (The problem has nothing to do with what happens after the array of structs is created -- as you show with your example code, editing the strings is easy enough if they're initialized correctly.)
The following code sample shows how to do this:
// Allocate the memory for the strings, on the stack so they'll be editable, and
// initialize them:
char ptn1[] = "\x32\x33\x12\x13\xba\xbb\x9a\xbc";
char ptn2[] = "\x12\x34\x56\x78";
// Now, initialize the structs with their char* pointers pointing at the editable
// strings:
ENTRY Entries[] = {
{ptn1, 44}
, {ptn2, 555}
};
That should work fine. However, note that the memory for the strings is on the stack, and thus will go away if you leave the current scope. That's not a problem if Entries is on the stack too (as it is in this example), of course, since it will go away at the same time.
Some Q/A on this:
Q: Why can't we initialize the strings in the array-of-structs initialization? A: Because the strings themselves are not in the structs, and initializing the array only allocates the memory for the array itself, not for things it points to.
Q: Can we include the strings in the structs, then? A: No; the structs have to have a constant size, and the strings don't have constant size.
Q: This does save memory over having a string literal and then malloc'ing storage and copying the string literal into it, thus resulting in two copies of the string, right? A: Probably not. When you write
char pattern[] = "\x12\x34\x56\x78";
what happens is that that literal value gets embedded in your compiled code (just like a string literal, basically), and then when that line is executed, the memory is allocated on the stack and the value from the code is copied into that memory. So you end up with two copies regardless -- the non-editable version in the source code (which has to be there because it's where the initial value comes from), and the editable version elsewhere in memory. This is really mostly about what's simple in the source code, and a little bit about helping the compiler optimize the instructions it uses to do the copying.

initializing std::string from char* without copy

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.