Why does std::fstream use char*? - c++

I'm writing a small program that reads the bytes from a file in binary file in groups of 16 bytes (please don't ask why), modifies them, and then writes them to another file.
The fstream::read function reads into a char * buffer, which I was initially passing to a function that looks like this:
char* modify (char block[16], std::string key)
The modification was done on block which was then returned. On roaming the posts of SO, I realized that it might be a better idea to use std::vector<char>. My immediate next worry was how to convert a char * to a std::vector<char>. Once again, SO gave me an answer.
But now what I'm wondering is: If its such a good idea to use std::vector<char> instead of char*, why do the fstream functions use char* at all?
Also, is it a good idea to convert the char* from fstream to std::vector<char> in the first place?
EDIT: I now realize that since fstream::read is used to write data into objects directly, char * is necessary. I must now modify my question. Firstly, why are there no overloaded functions for fstream::read? And secondly, in the program that I've written about, which is a better option?

To use it with a vector, do not pass a pointer to the vector. Instead, pass a pointer to the vector content:
vector<char> v(size);
stream.read(&v[0], size);

fstream() functions let you use char*s so you can point them at arbitrary pre-allocated buffers. std::vector<char> can be sized to provide an appropriate buffer, but it will be on the heap and there's allocation costs involved with that. Sometimes too you may want to read or write data to a specific location in memory - even in shared memory - rather than accepting whatever heap memory the vector happens to have allocated. Further, you may want to use fstream without having included the vector header... it's nice to be able to avoid unnecessary includes as it reduces compilation time.
As your buffers are always 16 bytes in size, it's probably best to allocate them as char [16] data members in an appropriate owning object (if any exists), or on the stack (i.e. some function's local variable).
vector<> is more useful when the alternative is heap allocation - whether because the size is unknown at compile time, or is particularly large, or you want more flexible control of the memory lifetime. It's also useful when you specifically want some of the other vector functionality, such as ability to change the number of elements afterwards, to sort the bytes etc. - it seems very unlikely you'll want to do any of that so a vector raises questions in the mind of the person reading your code about what you'll do for no good purpose. Still, the choice of char[16] vs. vector appears (based on your stated requirements) more a matter of taste than objective benefit.


char* buffer = new vs char buffer[] in C++

1. char* buffer = new char[size]
2. char buffer[size]
I'm new to C++ and I see most places creating buffers using the first example. I know in the first method, the data in that part of memory can be passed on until manually deleted using delete[]. While using the second method, the buffer would have a lifetime depending on the scope. If I only plan on the buffer lasting through a particular function and I don't plan on passing it to anything else, does it matter which method I use?
char* buffer = new char[size]
This is portable, but should be avoided. Until you really know what you're doing, using new directly is almost always a mistake (and when you do know what you're doing, it's still a mistake, but you'll know that without being told).
char buffer[size]
This depends on how you've defined size. If it's a constant (and fairly small), then this is all right. If it's not a constant, then any properly functioning compiler is required to reject it (but some common ones accept it anyway).
If it's constant, but "large", the compiler will accept the code, but it's likely to fail when you try to execute it. In this case, anything over a million is normally too large, and anything more than a few hundred thousand or so becomes suspect.
There is one exception to that though: if this is defined outside any function (i.e., as a global variable), then it can safely be much larger than a local variable can be. At the same time, I feel obliged to point out that I consider global variables something that should normally be avoided as a rule (and I'm far from being alone in holding that opinion).
Also note that these two are (more or less) mutually exclusive: if size is a constant, you generally want to avoid dynamic allocation, but it has to be a constant to just define an array (again, with a properly functioning compiler).
Unless size is fairly small constant, most of the time you should avoid both of these. What you most likely want is either:
std::string buffer;
std::vector<char> buffer(size);
or possibly:
std::array<char, size> buffer;
The first two of these can allocate space for the buffer dynamically, but generally keep the allocation "hidden", so you don't normally need to deal with it directly. The std::array is pretty much like the char buffer[size], (e.g., has a fixed size, and is really on suitable for fairly small sizes) but enforces that the size has to be a const, and gives you roughly the same interface as vector (minus anything that would change the number of elements, since that's a constant with std::array).
Main difference is that the first variant is dynamic allocation and the second one is not. You require dynamic allocation when you do not know at compile time, how much memory you will need. That means when "size" is not entirely a constant but somehow calculated at runtime depending on external input.
It is a good practice¹ to use containers that handle dynamic memory internally and thus ensure that you do not have to delete manually which is often a source for bugs and memory leaks.
A common, dynamic container for all kinds of data is
std::vector<char> (don't forget to #include <vector> )
However if you do handle texts, use the class std::string which also handles the memory internally. Raw char* arrays are a remainder from old C.
¹that good practice has the main exception when you don't use primitive data types but your own classes which store some massive amount of data. Reason is that std::vector<> performs copy operations when resized (and those are more expensive the larger the data).
However once you have come that far in your C++ projects, you should know about "smart pointers" by then which are the safe solution for those special cases.
By the way, with &the_vector[0] (address of the first element in the vector) you can get a pointer that behaves pretty much like the char array and thus can be used for older functions that do not accept vectors directly.

CString or char array which one is better in terms of memory

I read somewhere that usage of CString is costly. Can you calrify it with an example. Also among CString and char array, which is better in terms of memory.
CString in addition to array of chars (or wide chars) contains string size, allocated buffer size, and reference counter (serving additionally as a lock flag). The buffer containing the array of chars may be significantly larger than the string it contains -- it allows to reduce the number of time-costly allocation calls. In addition, when the CString is set to be zero-sized, it still contains two wchar characters.
Naturally, when you compare the size of CString with the size of corresponding C-style array, the array will be smaller. However, if you want to manipulate your string as extensively as CString allows, you will eventually define your own variables for string size, buffer size and sometimes refcounter and/or guard flags. Indeed, you need to store your string size to avoid calling strlen each time you need it. You need to store separately your buffer size if you allow your buffer to be larger than the string length, and avoid calling reallocs each time you add to or subtract from the string. And so on -- you trade some small size increase for significant increases in speed, safety and functionality.
So, the answer depends on what you are going to do with the string. Suppose you want a string to store the name of your class for logging -- there a C-style string (const and static) will do fine. If you need a string to manipulate and use it extensively with MFC or ATL-related classes, use CString family types. If you need to manipulate string in the "engine" parts of your application that are isolated from its interface, and may be converted to other platforms, use std::string or write your own string type to suit your particular needs (this can be really useful when you write the "glue" code to place between the interface and the engine, otherwise std::string is preferable).
CString is from MFC framework specific to windows. std::string is from c++ standard. They are library classes for managing strings in memory. std::string will provide you code portability across platforms.
Using raw array is always good for memory however one has to do operations on strings and it becomes difficult with raw array, consider out of bounds check, get the string length, copy the array or change the size because the string may grow, deleting the array, etc. For all these problem string utility class are good wrapper. The string class will keep the actual string in heap and you have the overhead of the string class itself. However that will provide you functionality to mange the string memory which anyway you have to write by hand.
Prefer std::string if you can, if not, use CString.
In almost all cases I encourage novice programmers to use std::string or CString(*). First they will do significantly less errors. I have seen many buffer overruns, memory invalidation or memory leaks, because of erroneous use of C arrays.
So which is more efficient, CString / std::string or raw character arrays? Memory wise, generally speaking, all CString ans std::string have more is one integer for the size. The question is does it matter?
So which is more efficient in terms of performance? Well it depends on what you are doing with it and how you are using your C-arrays. But passing CString or std::string arround can be computationally more efficient than C-arrays. The problem with C-arrays is that you can't be sure of who owns the memory and what type (heap/stack/literal) it is. Defensive programming results in more copies of arrays, you know, just to be sure that the memory you hold will be valid for the entire duration of when it is needed.
Why is std::string or CString more efficient than C-arrays, if they are passed around by value? This is a bit more complicated and for totally different reasons. For CString, this is simple, it implemented as a COW (copy on write) object. So when you have 5 objects that originate for one CString, it will not use more memory that one, until you start to make change on one object. std::string has stricter requirements and thus it is not allowed to share memory with other std:: string objects. But if you have a newer compiler, std::string should implement the move semantic and thus returning a string from a function will only result in a copy of the pointer not reallocation.
There are very few cases where raw C arrays are good and practical idea.
*) If you are already programming against MFC, why not just use CString.

Is there a way to pass ownership of an existing char* in heap to a std::string? [duplicate]

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

Using read() directly into a C++ std:vector

I'm wrapping up user space linux socket functionality in some C++ for an embedded system (yes, this is probably reinventing the wheel again).
I want to offer a read and write implementation using a vector.
Doing the write is pretty easy, I can just pass &myvec[0] and avoid unnecessary copying. I'd like to do the same and read directly into a vector, rather than reading into a char buffer then copying all that into a newly created vector.
Now, I know how much data I want to read, and I can allocate appropriately (vec.reserve()). I can also read into &myvec[0], though this is probably a VERY BAD IDEA. Obviously doing this doesn't allow myvec.size to return anything sensible. Is there any way of doing this that:
Doesn't completely feel yucky from a safety/C++ perspective
Doesn't involve two copies of the data block - once from kernel to user space and once from a C char * style buffer into a C++ vector.
Use resize() instead of reserve(). This will set the vector's size correctly -- and after that, &myvec[0] is, as usual, guaranteed to point to a continguous block of memory.
Edit: Using &myvec[0] as a pointer to the underlying array for both reading and writing is safe and guaranteed to work by the C++ standard. Here's what Herb Sutter has to say:
So why do people continually ask whether the elements of a std::vector (or std::array) are stored contiguously? The most likely reason is that they want to know if they can cough up pointers to the internals to share the data, either to read or to write, with other code that deals in C arrays. That’s a valid use, and one important enough to guarantee in the standard.
I'll just add a short clarification, because the answer was already given. resize() with argument greater than current size will add elements to the collection and default - initialize them. If You create
std::vector<unsigned char> v;
and then resize
All unsigned chars will get initialized to 0. Btw You can do the same with a constructor
std::vector<unsigned char> v(someSize);
So theoretically it may be a little bit slower than a raw array, but if the alternative is to copy the array anyway, it's better.
Reserve only prepares the memory, so that there is no reallocation needed, if new elements are added to the collection, but You can't access that memory.
You have to get an information about the number of element written to Your vector. The vector won't know anything about it.
Assuming it's a POD struct, call resize rather than reserve. You can define an empty default constructor if you really don't want the data zeroed out before you fill the vector.
It's somewhat low level, but the semantics of construction of POD structs is purposely murky. If memmove is allowed to copy-construct them, I don't see why a socket-read shouldn't.
EDIT: ah, bytes, not a struct. Well, you can use the same trick, and define a struct with just a char and a default constructor which neglects to initialize it… if I'm guessing correctly that you care, and that's why you wanted to call reserve instead of resize in the first place.
If you want the vector to reflect the amount of data read, call resize() twice. Once before the read, to give yourself space to read into. Once again after the read, to set the size of the vector to the number of bytes actually read. reserve() is no good, since calling reserve doesn't give you permission to access the memory allocated for the capacity.
The first resize() will zero the elements of the vector, but this is unlikely to create much of a performance overhead. If it does then you could try Potatoswatter's suggestion, or you could give up on the size of the vector reflecting the size of the data read, and instead just resize() it once, then re-use it exactly as you would an allocated buffer in C.
Performance-wise, if you're reading from a socket in user mode, most likely you can easily handle data as fast as it comes in. Maybe not if you're connecting to another machine on a gigabit LAN, or if your machine is frequently running 100% CPU or 100% memory bandwidth. A bit of extra copying or memsetting is no big deal if you are eventually going to block on a read call anyway.
Like you, I'd want to avoid the extra copy in user-space, but not for performance reasons, just because if I don't do it, I don't have to write the code for it...

