File I/O for a vector of arrays - c++

This questions has good answers on how to write an std::vector into a file: Reading and writing a std::vector into a file correctly
In my case, I have a vector of arrays:
vector<array<double, 3> > vec;
I would like to write into a file in order to get a file having the following format, where the values are doubles and the first number is the position in the vector and the second is the position in the array:
vec0_0 vec0_1 vec0_2 vec1_0 vec1_1 vec1_2 vec2_0 ...
Can I just use...
std::copy(vec.begin(), vec.end(), std::ostreambuf_iterator<char>(FILE));
...or...
size_t sz = vec.size();
FILE.write(reinterpret_cast<const char*>(&vec[0]), sz * sizeof(vec[0]));
...as proposed in the mentioned question for a scalar type, or do I need to do it differently because the type in the vector is an array?

From what I understand, std::array has contiguous storage. However, I don't think that guarantees there is no padding. If that were just a double[3], it would work out of the box, but I think you'd have to test very carefully and worry about portability with a std::array inside the container.
In fact, looking around there is already an example out there of a system that pads.
std::array alignment
sizeof(int) = 4;
sizeof( std::tr1::array< int,3 > ) = 16;
sizeof( std::tr1::array< int,4 > ) = 16;
sizeof( std::tr1::array< int,5 > ) = 32;
Presumably this padding is implementation defined, or maybe you can find it in the standard somewhere. In any case, I'd just iterate the thing or use a non-stl array.
I'd guess the concept is similar to a struct where there is often padding introduced to optimize memory access, however the compiler is optimizing that padding, and it can be turned off on most compilers with #pragma pack statements. Not true of stl containers to my knowledge.

Related

How do I serialise/deserialise a std::vector<bool> most efficiently?

I'm trying to write the contents of a std::vector<bool> to disk into a binary file. As the write() method of many of the STL output streams takes in a pointer to the array itself, as well as the number of bytes to write, for a 'normal' vector I'd end up doing something like this:
std::vector<unsigned int> dataVector = {0, 1, 2, 3, 4};
std::fstream outStream = std::fstream("vectordump.bin", std::ios::out | std::ios::binary);
outStream.write((char*) dataVector.data(), dataVector.size() * sizeof(unsigned int));
outStream.close();
However, the std::vector<bool> is a special case, as the STL implementation is allowed to pack the bools into single bits. The above approach will therefore technically not consistently work, because it's unspecified how the data is precisely laid out in memory.
Is there any way of serialising/deserialising my bool vector without having to pack/unpack the data?
I think you're better off to just translate that vector into std::vector<std::byte>/std::vector<unsigned char>.
std::vector<bool> isn't even required to have contiguous memory so writing starting from data() is implementation defined too.
No, there isn't.
Sorry.
A good reason to avoid this container!

Copying an array into a std::vector

I was searching about this topic and I found many ways to convert an array[] to an std::vector, like using:
assign(a, a + n)
or, direct in the constructor:
std::vector<unsigned char> v ( a, a + n );
Those solve my problem, but I am wondering if it is possible (and correct) to do:
myvet.resize( 10 );
memcpy( &myvet[0], buffer, 10 );
I am wondering this because I have the following code:
IDiskAccess::ERetRead nsDisks::DiskAccess::Read( std::vector< uint8_t >& bufferRead, int32_t totalToRead )
{
uint8_t* data = new uint8_t[totalToRead];
DWORD totalRead;
ReadFile( mhFile, data, totalToRead, &totalRead, NULL );
bufferRead.resize( totalRead );
bufferRead.assign( data, data + totalRead );
delete[] data;
return IDiskAccess::READ_OK;
}
And I would like to do:
IDiskAccess::ERetRead nsDisks::DiskAccess::Read( std::vector< uint8_t >& bufferRead, int32_t totalToRead )
{
bufferRead.resize( totalToRead );
DWORD totalRead;
ReadFile( mhFile, &bufferRead[0], totalToRead, &totalRead, NULL );
bufferRead.resize( totalRead );
return IDiskAccess::READ_OK;
}
(I have removed the error treatment of the ReadFile function to simplify the post).
It is working, but I am affraid that it is not safe. I believe it is ok, as the memory used by the vector is continuous, but I've never seen someone using vectors this way.
Is it correct to use vectors like this? Is there any other better option?
Yes it is safe with std::vector C++ standard guarantees that the elements will be stored at contiguous memory locations.
C++11 Standard:
23.3.6.1 Class templatevector overview [vector.overview]
A vector is a sequence container that supports random access iterators. In addition,itsupports(amortized) constant time insert and erase operations at the end; insert and erase in the middle take linear time. Storage management is handled automatically, though hints can be given to improve efficiency. The elements of a vector are stored contiguously, meaning that ifv is avector whereT is some type other than bool, then it obeys the identity&v[n] == &v[0] + n for all0 <= n < v.size().
Yes, it is fine to do that. You might want to do myvet.data() instead of &myvet[0] if it looks better to you, but they both have the same effect. Also, if circumstances permit, you can use std::copy instead and have more type-safety and all those other C++ standard library goodies.
The storage that a vector uses is guaranteed to be contiguous, which makes it suitable for use as a buffer or with other functions.
Make sure that you don't modify the vector (such as calling push_back on it, etc) while you are using the pointer you get from data or &v[0] because the vector could resize its buffer on one of those operations and invalidate the pointer.
That approach is correct, it only depends on the vector having contiguous memory which is required by the standard. I believe that in c++11 there is a new data() member function in vectors that returns a pointer to the buffer. Also note that in the case of `memcpy you need to pass the size in bytes not e size of the array
The memory in vector is guaranteed to be allocated contiguously, and unsigned char is POD, therefore it is totally safe to memcpy into it (assuming you don't copy more than you have allocated, of course).
Do your resize first, and it should work fine.
vector<int> v;
v.resize(100);
memcpy(&v[0], someArrayOfSize100, 100 * sizeof(int));
Yes, the solution using memcpy is correct; the buffer held by a vector is contiguous. But it's not quite type-safe, so prefer assign or std::copy.

What's the proper way to declare and initialize a (large) two dimensional object array in c++?

I need to create a large two dimensional array of objects. I've read some related questions on this site and others regarding multi_array, matrix, vector, etc, but haven't been able to put it together. If you recommend using one of those, please go ahead and translate the code below.
Some considerations:
The array is somewhat large (1300 x 1372).
I might be working with more than one of these at a time.
I'll have to pass it to a function at some point.
Speed is a large factor.
The two approaches that I thought of were:
Pixel pixelArray[1300][1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i][j].setOn(true);
...
}
}
and
Pixel* pixelArray[1300][1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i][j] = new Pixel();
pixelArray[i][j]->setOn(true);
...
}
}
What's the right approach/syntax here?
Edit:
Several answers have assumed Pixel is small - I left out details about Pixel for convenience, but it's not small/trivial. It has ~20 data members and ~16 member functions.
Your first approach allocates everything on stack, which is otherwise fine, but leads to stack overflow when you try to allocate too much stack. The limit is usually around 8 megabytes on modern OSes, so that allocating arrays of 1300 * 1372 elements on stack is not an option.
Your second approach allocates 1300 * 1372 elements on heap, which is a tremendous load for the allocator, which holds multiple linked lists to chunks of allocted and free memory. Also a bad idea, especially since Pixel seems to be rather small.
What I would do is this:
Pixel* pixelArray = new Pixel[1300 * 1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i * 1372 + j].setOn(true);
...
}
}
This way you allocate one large chunk of memory on heap. Stack is happy and so is the heap allocator.
If you want to pass it to a function, I'd vote against using simple arrays. Consider:
void doWork(Pixel array[][]);
This does not contain any size information. You could pass the size info via separate arguments, but I'd rather use something like std::vector<Pixel>. Of course, this requires that you define an addressing convention (row-major or column-major).
An alternative is std::vector<std::vector<Pixel> >, where each level of vectors is one array dimension. Advantage: The double subscript like in pixelArray[x][y] works, but the creation of such a structure is tedious, copying is more expensive because it happens per contained vector instance instead of with a simple memcpy, and the vectors contained in the top-level vector must not necessarily have the same size.
These are basically your options using the Standard Library. The right solution would be something like std::vector with two dimensions. Numerical libraries and image manipulation libraries come to mind, but matrix and image classes are most likely limited to primitive data types in their elements.
EDIT: Forgot to make it clear that everything above is only arguments. In the end, your personal taste and the context will have to be taken into account. If you're on your own in the project, vector plus defined and documented addressing convention should be good enough. But if you're in a team, and it's likely that someone will disregard the documented convention, the cascaded vector-in-vector structure is probably better because the tedious parts can be implemented by helper functions.
I'm not sure how complicated your Pixel data type is, but maybe something like this will work for you?:
std::fill(array, array+100, 42); // sets every value in the array to 42
Reference:
Initialization of a normal array with one default value
Check out Boost's Generic Image Library.
gray8_image_t pixelArray;
pixelArray.recreate(1300,1372);
for(gray8_image_t::iterator pIt = pixelArray.begin(); pIt != pixelArray.end(); pIt++) {
*pIt = 1;
}
My personal peference would be to use std::vector
typedef std::vector<Pixel> PixelRow;
typedef std::vector<PixelRow> PixelMatrix;
PixelMatrix pixelArray(1300, PixelRow(1372, Pixel(true)));
// ^^^^ ^^^^ ^^^^^^^^^^^
// Size 1 Size 2 default Value
While I wouldn't necessarily make this a struct, this demonstrates how I would approach storing and accessing the data. If Pixel is rather large, you may want to use a std::deque instead.
struct Pixel2D {
Pixel2D (size_t rsz_, size_t csz_) : data(rsz_*csz_), rsz(rsz_), csz(csz_) {
for (size_t r = 0; r < rsz; r++)
for (size_t c = 0; c < csz; c++)
at(r, c).setOn(true);
}
Pixel &at(size_t row, size_t col) {return data.at(row*csz+col);}
std::vector<Pixel> data;
size_t rsz;
size_t csz;
};

Nested STL vector using way too much memory

I have an STL vector My_Partition_Vector of Partition objects, defined as
struct Partition // the event log data structure
{
int key;
std::vector<std::vector<char> > partitions;
float modularity;
};
The actual nested structure of Partition.partitions varies from object to object but in the total number of chars stored in Partition.partitions is always 16.
I assumed therefore that the total size of the object should be more or less 24 bytes (16 + 4 + 4). However for every 100,000 items I add to My_Partition_Vector, memory consumption (found using ps -aux) increases by around 20 MB indicating around 209 bytes for each Partition Object.
This is a nearly 9 Fold increase!? Where is all this extra memory usage coming from? Some kind of padding in the STL vector, or the struct? How can I resolve this (and stop it reaching into swap)?
For one thing std::vector models a dynamic array so if you know that you'll always have 16 chars in partitions using std::vector is overkill. Use a good old C style array/matrix, boost::array or boost::multi_array.
To reduce the number of re-allocations needed for inserting/adding elements due to it's memory layout constrains std::vector is allowed to preallocate memory for a certain number of elements upfront (and it's capacity() member function will tell you how much).
While I think he may be overstating the situation just a tad, I'm in general agreement with DeadMG's conclusion that what you're doing is asking for trouble.
Although I'm generally the one looking at (whatever mess somebody has made) and saying "don't do that, just use a vector", this case might well be an exception. You're creating a huge number of objects that should be tiny. Unfortunately, a vector typically looks something like this:
template <class T>
class vector {
T *data;
size_t allocated;
size_t valid;
public:
// ...
};
On a typical 32-bit machine, that's twelve bytes already. Since you're using a vector<vector<char> >, you're going to have 12 bytes for the outer vector, plus twelve more for each vector it holds. Then, when you actually store any data in your vectors, each of those needs to allocate a block of memory from the free store. Depending on how your free store is implemented, you'll typically have a minimum block size -- frequently 32 or even 64 bytes. Worse, the heap typically has some overhead of its own, so it'll add some more memory onto each block, for its own book-keeping (e.g., it might use a linked list of blocks, adding another pointer worth of data to each allocation).
Just for grins, let's assume you average four vectors of four bytes apiece, and that your heap manager has a 32-byte minimum block size and one extra pointer (or int) for its bookkeeping (giving a real minimum of 36 bytes per block). Multiplying that out, I get 204 bytes apiece -- close enough to your 209 to believe that's reasonably close to what you're dealing with.
The question at that point is how to deal with the problem. One possibility is to try to work behind the scenes. All the containers in the standard library use allocators to get their memory. While they default allocator gets memory directly from the free store, you can substitute a different one if you choose. If you do some looking around, you can find any number of alternative allocators, many/most of which are to help with exactly the situation you're in -- reducing wasted memory when allocating lots of small objects. A couple to look at would be the Boost Pool Allocator and the Loki small object allocator.
Another possibility (that can be combined with the first) would be to quit using a vector<vector<char> > at all, and replace it with something like:
char partitions[16];
struct parts {
int part0 : 4;
int part1 : 4;
int part2 : 4;
int part3 : 4;
int part4 : 4;
int part5 : 4;
int part6 : 4
int part7 : 4;
};
For the moment, I'm assuming a maximum of 8 partitions -- if it could be 16, you can add more to parts. This should probably reduce memory usage quite a bit more, but (as-is) will affect your other code. You could also wrap this up into a small class of its own that provides 2D-style addressing to minimize impact on the rest of your code.
If you store a near constant amount of objects, then I suggest to use a 2-dimensional array.
The most likely reason for the memory consumption is debug data. STL implementations usually store A LOT of debug data. Never profile an application with debug flags on.
...This is a bit of a side conversation, but boost::multi_array was suggested as an alternative to the OP's use of nested vectors. My finding was that multi_array was using a similar amount of memory when applied to the OP's operating parameters.
I derived this code from the example at Boost.MultiArray. On my machine, this showed multi_array using about 10x more memory than ideally required assuming that the 16 bytes are arranged in a simple rectangular geometry.
To evaluate the memory usage, I checked the system monitor while the program was running and I compiled with
( export CXXFLAGS="-Wall -DNDEBUG -O3" ; make main && ./main )
Here's the code...
#include <iostream>
#include <vector>
#include "boost/multi_array.hpp"
#include <tr1/array>
#include <cassert>
#define USE_CUSTOM_ARRAY 0 // compare memory usage of my custom array vs. boost::multi_array
using std::cerr;
using std::vector;
#ifdef USE_CUSTOM_ARRAY
template< typename T, int YSIZE, int XSIZE >
class array_2D
{
std::tr1::array<char,YSIZE*XSIZE> data;
public:
T & operator () ( int y, int x ) { return data[y*XSIZE+x]; } // preferred accessor (avoid pointers)
T * operator [] ( int index ) { return &data[index*XSIZE]; } // alternative accessor (mimics boost::multi_array syntax)
};
#endif
int main ()
{
int COUNT = 1024*1024;
#if USE_CUSTOM_ARRAY
vector< array_2D<char,4,4> > A( COUNT );
typedef int index;
#else
typedef boost::multi_array<char,2> array_type;
typedef array_type::index index;
vector<array_type> A( COUNT, array_type(boost::extents[4][4]) );
#endif
// Assign values to the elements
int values = 0;
for ( int n=0; n<COUNT; n++ )
for(index i = 0; i != 4; ++i)
for(index j = 0; j != 4; ++j)
A[n][i][j] = values++;
// Verify values
int verify = 0;
for ( int n=0; n<COUNT; n++ )
for(index i = 0; i != 4; ++i)
for(index j = 0; j != 4; ++j)
{
assert( A[n][i][j] == (char)((verify++)&0xFF) );
#if USE_CUSTOM_ARRAY
assert( A[n][i][j] == A[n](i,j) ); // testing accessors
#endif
}
cerr <<"spinning...\n";
while ( 1 ) {} // wait here (so you can check memory usage in the system monitor)
return 0;
}
On my system, sizeof(vector) is 24. This probably corresponds to 3 8-byte members: capacity, size, and pointer. Additionally, you need to consider the actual allocations which would be between 1 and 16 bytes (plus allocation overhead) for the inner vector and between 24 and 384 bytes for the outer vector ( sizeof(vector) * partitions.capacity() ).
I wrote a program to sum this up...
for ( int Y=1; Y<=16; Y++ )
{
const int X = 16/Y;
if ( X*Y != 16 ) continue; // ignore imperfect geometries
Partition a;
a.partitions = vector< vector<char> >( Y, vector<char>(X) );
int sum = sizeof(a); // main structure
sum += sizeof(vector<char>) * a.partitions.capacity(); // outer vector
for ( int i=0; i<(int)a.partitions.size(); i++ )
sum += sizeof(char) * a.partitions[i].capacity(); // inner vector
cerr <<"X="<<X<<", Y="<<Y<<", size = "<<sum<<"\n";
}
The results show how much memory (not including allocation overhead) is need for each simple geometry...
X=16, Y=1, size = 80
X=8, Y=2, size = 104
X=4, Y=4, size = 152
X=2, Y=8, size = 248
X=1, Y=16, size = 440
Look at the how the "sum" is calculated to see what all of the components are.
The results posted are based on my 64-bit architecture. If you have a 32-bit architecture the sizes would be almost half as much -- but still a lot more than what you had expected.
In conclusion, std::vector<> is not very space efficient for doing a whole bunch of very small allocations. If your application is required to be efficient, then you should use a different container.
My approach to solving this would probably be to allocate the 16 chars with
std::tr1::array<char,16>
and wrap that with a custom class that maps 2D coordinates onto the array allocation.
Below is a very crude way of doing this, just as an example to get you started. You would have to change this to meet your specific needs -- especially the ability to specify the geometry dynamically.
template< typename T, int YSIZE, int XSIZE >
class array_2D
{
std::tr1::array<char,YSIZE*XSIZE> data;
public:
T & operator () ( int y, int x ) { return data[y*XSIZE+x]; } // preferred accessor (avoid pointers)
T * operator [] ( int index ) { return &data[index*XSIZE]; } // alternative accessor (mimics boost::multi_array syntax)
};
16 bytes is a complete and total waste. You're storing a hell of a lot of data about very small objects. A vector of vector is the wrong solution to use. You should log sizeof(vector) - it's not insignificant, as it performs a substantial function. On my compiler, sizeof(vector) is 20. So each Partition is 4 + 4 + 16 + 20 + 20*number of inner partitions + memory overheads like the vectors not being the perfect size.
You're only storing 16 bytes of data, and wasting ridiculous amounts of memory allocating them in the most segregated, highest overhead way you could possibly think of. The vector doesn't use a lot of memory - you have a terrible design.

How do you copy the contents of an array to a std::vector in C++ without looping?

I have an array of values that is passed to my function from a different part of the program that I need to store for later processing. Since I don't know how many times my function will be called before it is time to process the data, I need a dynamic storage structure, so I chose a std::vector. I don't want to have to do the standard loop to push_back all the values individually, it would be nice if I could just copy it all using something similar to memcpy.
There have been many answers here and just about all of them will get the job done.
However there is some misleading advice!
Here are the options:
vector<int> dataVec;
int dataArray[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
unsigned dataArraySize = sizeof(dataArray) / sizeof(int);
// Method 1: Copy the array to the vector using back_inserter.
{
copy(&dataArray[0], &dataArray[dataArraySize], back_inserter(dataVec));
}
// Method 2: Same as 1 but pre-extend the vector by the size of the array using reserve
{
dataVec.reserve(dataVec.size() + dataArraySize);
copy(&dataArray[0], &dataArray[dataArraySize], back_inserter(dataVec));
}
// Method 3: Memcpy
{
dataVec.resize(dataVec.size() + dataArraySize);
memcpy(&dataVec[dataVec.size() - dataArraySize], &dataArray[0], dataArraySize * sizeof(int));
}
// Method 4: vector::insert
{
dataVec.insert(dataVec.end(), &dataArray[0], &dataArray[dataArraySize]);
}
// Method 5: vector + vector
{
vector<int> dataVec2(&dataArray[0], &dataArray[dataArraySize]);
dataVec.insert(dataVec.end(), dataVec2.begin(), dataVec2.end());
}
To cut a long story short Method 4, using vector::insert, is the best for bsruth's scenario.
Here are some gory details:
Method 1 is probably the easiest to understand. Just copy each element from the array and push it into the back of the vector. Alas, it's slow. Because there's a loop (implied with the copy function), each element must be treated individually; no performance improvements can be made based on the fact that we know the array and vectors are contiguous blocks.
Method 2 is a suggested performance improvement to Method 1; just pre-reserve the size of the array before adding it. For large arrays this might help. However the best advice here is never to use reserve unless profiling suggests you may be able to get an improvement (or you need to ensure your iterators are not going to be invalidated). Bjarne agrees. Incidentally, I found that this method performed the slowest most of the time though I'm struggling to comprehensively explain why it was regularly significantly slower than method 1...
Method 3 is the old school solution - throw some C at the problem! Works fine and fast for POD types. In this case resize is required to be called since memcpy works outside the bounds of vector and there is no way to tell a vector that its size has changed. Apart from being an ugly solution (byte copying!) remember that this can only be used for POD types. I would never use this solution.
Method 4 is the best way to go. It's meaning is clear, it's (usually) the fastest and it works for any objects. There is no downside to using this method for this application.
Method 5 is a tweak on Method 4 - copy the array into a vector and then append it. Good option - generally fast-ish and clear.
Finally, you are aware that you can use vectors in place of arrays, right? Even when a function expects c-style arrays you can use vectors:
vector<char> v(50); // Ensure there's enough space
strcpy(&v[0], "prefer vectors to c arrays");
If you can construct the vector after you've gotten the array and array size, you can just say:
std::vector<ValueType> vec(a, a + n);
...assuming a is your array and n is the number of elements it contains. Otherwise, std::copy() w/resize() will do the trick.
I'd stay away from memcpy() unless you can be sure that the values are plain-old data (POD) types.
Also, worth noting that none of these really avoids the for loop--it's just a question of whether you have to see it in your code or not. O(n) runtime performance is unavoidable for copying the values.
Finally, note that C-style arrays are perfectly valid containers for most STL algorithms--the raw pointer is equivalent to begin(), and (ptr + n) is equivalent to end().
If all you are doing is replacing the existing data, then you can do this
std::vector<int> data; // evil global :)
void CopyData(int *newData, size_t count)
{
data.assign(newData, newData + count);
}
std::copy is what you're looking for.
Since I can only edit my own answer, I'm going to make a composite answer from the other answers to my question. Thanks to all of you who answered.
Using std::copy, this still iterates in the background, but you don't have to type out the code.
int foo(int* data, int size)
{
static std::vector<int> my_data; //normally a class variable
std::copy(data, data + size, std::back_inserter(my_data));
return 0;
}
Using regular memcpy. This is probably best used for basic data types (i.e. int) but not for more complex arrays of structs or classes.
vector<int> x(size);
memcpy(&x[0], source, size*sizeof(int));
int dataArray[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };//source
unsigned dataArraySize = sizeof(dataArray) / sizeof(int);
std::vector<int> myvector (dataArraySize );//target
std::copy ( myints, myints+dataArraySize , myvector.begin() );
//myvector now has 1,2,3,...10 :-)
Yet another answer, since the person said "I don't know how many times my function will be called", you could use the vector insert method like so to append arrays of values to the end of the vector:
vector<int> x;
void AddValues(int* values, size_t size)
{
x.insert(x.end(), values, values+size);
}
I like this way because the implementation of the vector should be able to optimize for the best way to insert the values based on the iterator type and the type itself. You are somewhat replying on the implementation of stl.
If you need to guarantee the fastest speed and you know your type is a POD type then I would recommend the resize method in Thomas's answer:
vector<int> x;
void AddValues(int* values, size_t size)
{
size_t old_size(x.size());
x.resize(old_size + size, 0);
memcpy(&x[old_size], values, size * sizeof(int));
}
avoid the memcpy, I say. No reason to mess with pointer operations unless you really have to. Also, it will only work for POD types (like int) but would fail if you're dealing with types that require construction.
In addition to the methods presented above, you need to make sure you use either std::Vector.reserve(), std::Vector.resize(), or construct the vector to size, to make sure your vector has enough elements in it to hold your data. if not, you will corrupt memory. This is true of either std::copy() or memcpy().
This is the reason to use vector.push_back(), you can't write past the end of the vector.
Assuming you know how big the item in the vector are:
std::vector<int> myArray;
myArray.resize (item_count, 0);
memcpy (&myArray.front(), source, item_count * sizeof(int));
http://www.cppreference.com/wiki/stl/vector/start