c++ data structure for storing millions of int16 - c++

Good afternoon.
I have the following situation: there are three sets of data, each set is a two-dimensional table in which about 50 million fields. (~ 6000 lines and ~ 8000 columns).
That data are stored in binary files
Language - c + +
I only need to display this data.
But I stuck when tried to read.(std::vector used but the waiting time is too long)
What is the best way to read\store such amount of data? (std::vectors, simple pointers, special libraries)?
Maybe links to articles, books, or just personal experience?

Well, if you don't need all this data at once, you may use a memory mapped file technique and read data as it was a giant array. Generally operating system / file system cache works well enough for most applications, but certainly YMMV.

There's no reason you shouldn't use plain old read and write on ifstream/ofstream. The following code doesn't take very long for a BigArray b( 6000, 8000 );
#include <fstream>
#include <iostream>
#include <string>
#include <stdlib.h>
class BigArray {
public:
BigArray( int r, int c ) : rows(r), cols(c){
data = (int*)malloc(rows*cols*sizeof(int));
if( NULL == data ){
std::cout << "ERROR\n";
}
}
virtual ~BigArray(){ free( data ); }
void fill( int n ){
int v = 0;
int * intptr = data;
for( int irow = 0; irow < rows; irow++ ){
for( int icol = 0; icol < cols; icol++ ){
*intptr++ = v++;
v %= n;
}
}
}
void readFromFile( std::string path ){
std::ifstream inf( path.c_str(), std::ifstream::binary );
inf.read( (char*)data, rows*cols*sizeof(*data) );
inf.close();
}
void writeToFile( std::string path ){
std::ofstream outf( path.c_str(), std::ifstream::binary );
outf.write( (char*)data, rows*cols*sizeof(*data) );
outf.close();
}
private:
int rows;
int cols;
int* data;
};

Related

(C++) Fastest way possible for reading in matrix files (arbitrary size)

I'm developing a bioinformatic tool, which requires reading in millions of matrix files (average dimension = (20k, 20k)). They are tab-delimited text files, and they look something like:
0.53 0.11
0.24 0.33
Because the software reads the matrix files one at a time, memory is not an issue, but it's very slow. The following is my current function for reading in a matrix file. I first make a matrix object using a double pointer, then fill in the matrix by looping through an input file .
float** make_matrix(int nrow, int ncol, float val){
float** M = new float *[nrow];
for(int i = 0; i < nrow; i++) {
M[i] = new float[ncol];
for(int j = 0; j < ncol; j++) {
M[i][j] = val;
}
}
return M;
}
float** read_matrix(string fname, int dim_1, int dim_2){
float** K = make_matrix(dim_1, dim_2, 0);
ifstream ifile(fname);
for (int i = 0; i < dim_1; ++i) {
for (int j = 0; j < dim_2; ++j) {
ifile >> K[i][j];
}
}
ifile.clear();
ifile.seekg(0, ios::beg);
return K;
}
Is there a much faster way to do this? From my experience with python, reading in a matrix file using pandas is so much faster than using python for-loops. Is there a trick like that in c++?
(added)
Thanks so much everyone for all your suggestions and comments!
The fastest way, by far, is to change the way you write those files: write in binary format, two int first (width, height) then just dump your values.
You will be able to load it in just three read calls.
Just for fun, I measured the program posted above (using a 20,000x20,000 ASCII input file, as described) on my Mac Mini (3.2GHz i7 with SSD drive) and found that it took about 102 seconds to parse in the file using the posted code.
Then I wrote a version of the same function that uses the C stdio API (fopen()/fread()/fclose()) and does character-by-character parsing into a 1D float array. This implementation takes about 13 seconds to parse in the file on the same hardware, so it's about 7 times faster.
Both programs were compiled with g++ -O3 test_read_matrix.cpp.
float* faster_read_matrix(string fname, int numRows, int numCols)
{
FILE * fpIn = fopen(fname.c_str(), "r");
if (fpIn == NULL)
{
printf("Couldn't open file [%s] for input!\n", fname.c_str());
return NULL;
}
float* K = new float[numRows*numCols];
// We'll hold the current number in (numberBuf) until we're ready to parse it
char numberBuf[128] = {'\0'};
int numCharsInBuffer = 0;
int curRow = 0, curCol = 0;
while(curRow < numRows)
{
char tempBuf[4*1024]; // an arbitrary size
const size_t bytesRead = fread(tempBuf, 1, sizeof(tempBuf), fpIn);
if (bytesRead <= 0)
{
if (bytesRead < 0) perror("fread");
break;
}
for (size_t i=0; i<bytesRead; i++)
{
const char c = tempBuf[i];
if ((c=='.')||(c=='+')||(c=='-')||(isdigit(c)))
{
if ((numCharsInBuffer+1) < sizeof(numberBuf)) numberBuf[numCharsInBuffer++] = c;
else
{
printf("Error, number string was too long for numberBuf!\n");
}
}
else
{
if (numCharsInBuffer > 0)
{
// Parse the current number-chars we have assembled into (numberBuf) and reset (numberBuf) to empty
numberBuf[numCharsInBuffer] = '\0';
if (curCol < numCols) K[curRow*numCols+curCol] = strtod(numberBuf, NULL);
else
{
printf("Error, too many values in row %i! (Expected %i, found at least %i)\n", curRow, numCols, curCol);
}
curCol++;
}
numCharsInBuffer = 0;
if (c == '\n')
{
curRow++;
curCol = 0;
if (curRow >= numRows) break;
}
}
}
}
fclose(fpIn);
if (curRow != numRows) printf("Warning: I read %i lines in the file, but I expected there would be %i!\n", curRow, numRows);
return K;
}
I am dissatisfied with Jeremy Friesner’s otherwise excellent answer because it:
blames the problem to be with C++'s I/O system (which it is not)
fixes the problem by circumventing the actual I/O problem without being explicit about how it is a significant contributor to speed
modifies memory accesses which (may or may not) contribute to speed, and does so in a way that very large matrices may not be supported
The reason his code runs so much faster is because he removes the single most important bottleneck: unoptimized disk access. JWO’s original code can be brought to match with three extra lines of code:
float** read_matrix(std::string fname, int dim_1, int dim_2){
float** K = make_matrix(dim_1, dim_2, 0);
std::size_t buffer_size = 4*1024; // 1
char buffer[buffer_size]; // 2
std::ifstream ifile(fname);
ifile.rdbuf()->pubsetbuf(buffer, buffer_size); // 3
for (int i = 0; i < dim_1; ++i) {
for (int j = 0; j < dim_2; ++j) {
ss >> K[i][j];
}
}
// ifile.clear();
// ifile.seekg(0, std::ios::beg);
return K;
}
The addition exactly replicates Friesner’s design, but using the C++ library capabilities without all the extra programming grief on our end.
You’ll notice I also removed a couple lines at the bottom that should be inconsequential to program function and correctness, but which may cause a minor cumulative time issue as well. (If they are not inconsequential, that is a bug and should be fixed!)
How much difference this all makes depends entirely on the quality of the C++ Standard Library implementation. AFAIK the big three modern C++ compilers (MSVC, GCC, and Clang) all have sufficiently-optimized I/O handling to make the issue moot.
locale
One other thing that may also make a difference is to .imbue() the stream with the default "C" locale, which avoids a lot of special handling for numbers in locale-dependent formats other than what your files use. You only need to bother to do this if you have changed your global locale, though.
ifile.imbue(std::locale(""));
redundant initialization
Another thing that is killing your time is the effort to zero-initialize the array when you create it. Don’t do that if you don’t need it! (You don’t need it here because you know the total extents and will fill them properly. C++17 and later is nice enough to give you a zero value if the input stream goes bad, too. So you get zeros for unread values either way.)
dynamic memory block size
Finally, keeping memory accesses to an array of array should not significantly affect speed, but it still might be worth testing if you can change it. This is assuming that the resulting matrix will never be too large for the memory manager to return as a single block (and consequently crash your program).
A common design is to allocate the entire array as a single block, with the requested size plus size for the array of pointers to the rest of the block. This allows you to delete the array in a single delete[] statement. Again, I don’t believe this should be an optimization issue you need to care about until your profiler says so.
At the risk of the answer being considered incomplete (no code examples), I would like to add to the other answers additional options how to tackle the problem:
Use a binary format (width,height, values...) as file format and then use file mapping (MapViewOfFile() on Windows, mmap() or so on posix/unix systems).
Then, you can simply point your "matrix structure" pointer to the mapped address space and you are done. And in case, you do something like sparse access to the matrix, it can even save some real IO. If you always do full access to all elements of the matrix (no sparse matrices etc.), it is still quite elegant and probably faster than malloc/read.
Replacements for c++ iostream, which is known to be quite slow and should not be used for performance critical stuff:
Have a look at the {fmt} library, which has become quite popular in recent years and claims to be quite fast.
Back in the days, when I did a lot of numerics on large data sets, I always opted for binary files for storage. (It was back in the days, when the fastest CPU you get your hands on were the Pentium 1 (with the floating point bug :)). Back then, all was slower, memory was much more limited (we had MB not GB as units for RAM in our systems) and all in all, nearly 20 years have passed since.
So, as a refresher, I did write some code to show, how much faster than iostream and text files you can do if you do not have extra constraints (such as endianess of different cpus etc.).
So far, my little test only has an iostream and a binary file version with a) stdio fread() kind of loading and b) mmap(). Since I sit in front of a debian bullseye computer, my code uses linux specific stuff for the mmap() approach. To run it on Windows, you have to change a few lines of code and some includes.
Edit: I added a save function using {fmt} now as well.
Edit: I added a load function with stdio now as well.
Edit: To reduce memory workload, I reordered the code somewhat
and now only keep 2 matrix instances in memory at any given time.
The program does the following:
create a 20k x 20k matrix in ram (in a struct named Matrix_t). With random values, slowly generated by std::random.
Write the matrix with iostream to a text file.
Write the matrix with stdio to a binary file.
Create a new matrix textMatrix by loading its data from the text file.
Create a new matrix inMemoryMatrix by loading its data from the binary file with a few fread() calls.
mmap() the binary file and use it under the name mappedMatrix.
Compare each of the loaded matrices to the original randomMatrix to see if the round-trip worked.
Here the results I got on my machine after compiling this work of wonder with clang++ -O3 -o fmatio fast-matrix-io.cpp -lfmt:
./fmatio
creating random matrix (20k x 20k) (27.0775seconds)
the first 10 floating values in randomMatrix are:
57970.2 -365700 -986079 44657.8 826968 -506928 668277 398241 -828176 394645
saveMatrixAsText_IOSTREAM()
saving matrix with iostream. (192.749seconds)
saveMatrixAsText_FMT(mat0_fmt.txt)
saving matrix with {fmt}. (34.4932seconds)
saveMatrixAsBinary()
saving matrix into a binary file. (30.7591seconds)
loadMatrixFromText_IOSTREAM()
loading matrix from text file with iostream. (102.074seconds)
randomMatrix == textMatrix
comparing randomMatrix with textMatrix. (0.125328seconds)
loadMatrixFromText_STDIO(mat0_fmt.txt)
loading matrix from text file with stdio. (71.2746seconds)
randomMatrix == textMatrix
comparing randomMatrix with textMatrix (stdio). (0.124684seconds)
loadMatrixFromBinary(mat0.bin)
loading matrix from binary file into memory. (0.495685seconds)
randomMatrix == inMemoryMatrix
comparing randomMatrix with inMemoryMatrix. (0.124206seconds)
mapMatrixFromBinaryFile(mat0.bin)
mapping a view to a matrix in a binary file. (4.5883e-05seconds)
randomMatrix == mappedMatrix
comparing randomMatrix with mappedMatrix. (0.158459seconds)
And here is the code:
#include <cinttypes>
#include <memory>
#include <random>
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <chrono>
#include <limits>
#include <iomanip>
// includes for mmap()...
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
// includes for {fmt}...
#include <fmt/core.h>
#include <fmt/os.h>
struct StopWatch {
using Clock = std::chrono::high_resolution_clock;
using TimePoint =
std::chrono::time_point<Clock>;
using Duration =
std::chrono::duration<double>;
void start(const char* description) {
this->description = std::string(description);
tstart = Clock::now();
}
void stop() {
TimePoint tend = Clock::now();
Duration elapsed = tend - tstart;
std::cout << description << " (" << elapsed.count()
<< "seconds)" << std::endl;
}
TimePoint tstart;
std::string description;
};
struct Matrix_t {
uint32_t ncol;
uint32_t nrow;
float values[];
inline uint32_t to_index(uint32_t col, uint32_t row) const {
return ncol * row + col;
}
};
template <class Initializer>
Matrix_t *createMatrix
( uint32_t ncol,
uint32_t nrow,
Initializer initFn
) {
size_t nfloats = ncol*nrow;
size_t nbytes = UINTMAX_C(8) + nfloats * sizeof(float);
Matrix_t * result =
reinterpret_cast<Matrix_t*>(operator new(nbytes));
if (nullptr != result) {
result->ncol = ncol;
result->nrow = nrow;
for (uint32_t row = 0; row < nrow; row++) {
for (uint32_t col = 0; col < ncol; col++) {
result->values[result->to_index(col,row)] =
initFn(ncol,nrow,col,row);
}
}
}
return result;
}
void saveMatrixAsText_IOSTREAM(const char* filePath,
const Matrix_t* matrix) {
std::cout << "saveMatrixAsText_IOSTREAM()" << std::endl;
if (nullptr == matrix) {
std::cout << "cannot save matrix - no matrix!" << std::endl;
}
std::ofstream outFile(filePath);
if (outFile) {
outFile << matrix->ncol << " " << matrix->nrow << std::endl;
const auto defaultPrecision = outFile.precision();
outFile.precision
(std::numeric_limits<float>::max_digits10);
for (uint32_t row = 0; row < matrix->nrow; row++) {
for (uint32_t col = 0; col < matrix->ncol; col++) {
outFile << matrix->values[matrix->to_index(col,row)]
<< " ";
}
outFile << std::endl;
}
} else {
std::cout << "could not open " << filePath << " for writing."
<< std::endl;
}
}
void saveMatrixAsText_FMT(const char* filePath,
const Matrix_t* matrix) {
std::cout << "saveMatrixAsText_FMT(" << filePath << ")"
<< std::endl;
if (nullptr == matrix) {
std::cout << "cannot save matrix - no matrix!" << std::endl;
}
auto outFile = fmt::output_file(filePath);
outFile.print("{} {}\n", matrix->ncol, matrix->nrow);
for (uint32_t row = 0; row < matrix->nrow; row++) {
outFile.print("{}", matrix->values[matrix->to_index(0,row)]);
for (uint32_t col = 1; col < matrix->ncol; col++) {
outFile.print(" {}",
matrix->values[matrix->to_index(col,row)]);
}
outFile.print("\n");
}
}
void saveMatrixAsBinary(const char* filePath,
const Matrix_t* matrix) {
std::cout << "saveMatrixAsBinary()" << std::endl;
FILE * outFile = fopen(filePath, "wb");
if (nullptr != outFile) {
fwrite( &matrix->ncol, 4, 1, outFile);
fwrite( &matrix->nrow, 4, 1, outFile);
size_t nfloats = matrix->ncol * matrix->nrow;
fwrite( &matrix->values, sizeof(float), nfloats, outFile);
fclose(outFile);
} else {
std::cout << "could not open " << filePath << " for writing."
<< std::endl;
}
}
Matrix_t* loadMatrixFromText_IOSTREAM(const char* filePath) {
std::cout << "loadMatrixFromText_IOSTREAM()" << std::endl;
std::ifstream inFile(filePath);
if (inFile) {
uint32_t ncol;
uint32_t nrow;
inFile >> ncol;
inFile >> nrow;
uint32_t nfloats = ncol * nrow;
auto loader =
[&inFile]
(uint32_t , uint32_t , uint32_t , uint32_t )
-> float
{
float value;
inFile >> value;
return value;
};
Matrix_t * matrix = createMatrix( ncol, nrow, loader);
return matrix;
} else {
std::cout << "could not open " << filePath << "for reading."
<< std::endl;
}
return nullptr;
}
Matrix_t* loadMatrixFromText_STDIO(const char* filePath) {
std::cout << "loadMatrixFromText_STDIO(" << filePath << ")"
<< std::endl;
Matrix_t* matrix = nullptr;
FILE * inFile = fopen(filePath, "rt");
if (nullptr != inFile) {
uint32_t ncol;
uint32_t nrow;
fscanf(inFile, "%d %d", &ncol, &nrow);
auto loader =
[&inFile]
(uint32_t , uint32_t , uint32_t , uint32_t )
-> float
{
float value;
fscanf(inFile, "%f", &value);
return value;
};
matrix = createMatrix( ncol, nrow, loader);
fclose(inFile);
} else {
std::cout << "could not open " << filePath << "for reading."
<< std::endl;
}
return matrix;
}
Matrix_t* loadMatrixFromBinary(const char* filePath) {
std::cout << "loadMatrixFromBinary(" << filePath << ")"
<< std::endl;
FILE * inFile = fopen(filePath, "rb");
if (nullptr != inFile) {
uint32_t ncol;
uint32_t nrow;
fread( &ncol, 4, 1, inFile);
fread( &nrow, 4, 1, inFile);
uint32_t nfloats = ncol * nrow;
uint32_t nbytes = nfloats * sizeof(float) + UINT32_C(8);
Matrix_t* matrix =
reinterpret_cast<Matrix_t*>
(operator new (nbytes));
if (nullptr != matrix) {
matrix->ncol = ncol;
matrix->nrow = nrow;
fread( &matrix->values[0], sizeof(float), nfloats, inFile);
return matrix;
} else {
std::cout << "could not find memory for the matrix."
<< std::endl;
}
fclose(inFile);
} else {
std::cout << "could not open file "
<< filePath << " for reading." << std::endl;
}
return nullptr;
}
void freeMatrix(Matrix_t* matrix) {
operator delete(matrix);
}
Matrix_t* mapMatrixFromBinaryFile(const char* filePath) {
std::cout << "mapMatrixFromBinaryFile(" << filePath << ")"
<< std::endl;
Matrix_t * matrix = nullptr;
int fd = open( filePath, O_RDONLY);
if (-1 != fd) {
struct stat sb;
if (-1 != fstat(fd, &sb)) {
auto fileSize = sb.st_size;
matrix =
reinterpret_cast<Matrix_t*>
(mmap(nullptr, fileSize, PROT_READ, MAP_PRIVATE, fd, 0));
if (nullptr == matrix) {
std::cout << "mmap() failed!" << std::endl;
}
} else {
std::cout << "fstat() failed!" << std::endl;
}
close(fd);
} else {
std::cout << "open() failed!" << std::endl;
}
return matrix;
}
void unmapMatrix(Matrix_t* matrix) {
if (nullptr == matrix)
return;
size_t nbytes =
UINTMAX_C(8) +
sizeof(float) * matrix->ncol * matrix->nrow;
munmap(matrix, nbytes);
}
bool areMatricesEqual( const Matrix_t* m1, const Matrix_t* m2) {
if (nullptr == m1) return false;
if (nullptr == m2) return false;
if (m1->ncol != m2->ncol) return false;
if (m1->nrow != m2->nrow) return false;
// both exist and have same size...
size_t nfloats = m1->ncol * m1->nrow;
size_t nbytes = nfloats * sizeof(float);
return 0 == memcmp( m1->values, m2->values, nbytes);
}
int main(int argc, const char* argv[]) {
std::random_device rdev;
std::default_random_engine reng(rdev());
std::uniform_real_distribution<> rdist(-1.0E6F, 1.0E6F);
StopWatch sw;
auto randomInitFunction =
[&reng,&rdist]
(uint32_t ncol, uint32_t nrow, uint32_t col, uint32_t row)
-> float
{
return rdist(reng);
};
sw.start("creating random matrix (20k x 20k)");
Matrix_t * randomMatrix =
createMatrix(UINT32_C(20000),
UINT32_C(20000),
randomInitFunction);
sw.stop();
if (nullptr != randomMatrix) {
std::cout
<< "the first 10 floating values in randomMatrix are: "
<< std::endl;
std::cout << randomMatrix->values[0];
for (size_t i = 1; i < 10; i++) {
std::cout << " " << randomMatrix->values[i];
}
std::cout << std::endl;
sw.start("saving matrix with iostream.");
saveMatrixAsText_IOSTREAM("mat0_iostream.txt", randomMatrix);
sw.stop();
sw.start("saving matrix with {fmt}.");
saveMatrixAsText_FMT("mat0_fmt.txt", randomMatrix);
sw.stop();
sw.start("saving matrix into a binary file.");
saveMatrixAsBinary("mat0.bin", randomMatrix);
sw.stop();
sw.start("loading matrix from text file with iostream.");
Matrix_t* textMatrix =
loadMatrixFromText_IOSTREAM("mat0_iostream.txt");
sw.stop();
sw.start("comparing randomMatrix with textMatrix.");
if (!areMatricesEqual(randomMatrix, textMatrix)) {
std::cout << "randomMatrix != textMatrix!" << std::endl;
} else {
std::cout << "randomMatrix == textMatrix" << std::endl;
}
sw.stop();
freeMatrix(textMatrix);
textMatrix = nullptr;
sw.start("loading matrix from text file with stdio.");
textMatrix =
loadMatrixFromText_STDIO("mat0_fmt.txt");
sw.stop();
sw.start("comparing randomMatrix with textMatrix (stdio).");
if (!areMatricesEqual(randomMatrix, textMatrix)) {
std::cout << "randomMatrix != textMatrix!" << std::endl;
} else {
std::cout << "randomMatrix == textMatrix" << std::endl;
}
sw.stop();
freeMatrix(textMatrix);
textMatrix = nullptr;
sw.start("loading matrix from binary file into memory.");
Matrix_t* inMemoryMatrix =
loadMatrixFromBinary("mat0.bin");
sw.stop();
sw.start("comparing randomMatrix with inMemoryMatrix.");
if (!areMatricesEqual(randomMatrix, inMemoryMatrix)) {
std::cout << "randomMatrix != inMemoryMatrix!"
<< std::endl;
} else {
std::cout << "randomMatrix == inMemoryMatrix" << std::endl;
}
sw.stop();
freeMatrix(inMemoryMatrix);
inMemoryMatrix = nullptr;
sw.start("mapping a view to a matrix in a binary file.");
Matrix_t* mappedMatrix =
mapMatrixFromBinaryFile("mat0.bin");
sw.stop();
sw.start("comparing randomMatrix with mappedMatrix.");
if (!areMatricesEqual(randomMatrix, mappedMatrix)) {
std::cout << "randomMatrix != mappedMatrix!"
<< std::endl;
} else {
std::cout << "randomMatrix == mappedMatrix" << std::endl;
}
sw.stop();
unmapMatrix(mappedMatrix);
mappedMatrix = nullptr;
freeMatrix(randomMatrix);
} else {
std::cout << "could not create random matrix!" << std::endl;
}
return 0;
}
Please note, that binary formats where you simply cast to a struct pointer also depend on how the compiler does alignment and padding within structures. In my case, I was lucky and it worked. On other systems, you might have to tweak a little (#pragma pack(4) or something along that line) to make it work.

C++ Declaring arrays in class and declaring 2d arrays in class

I'm new with using classes and I encountered a problem while delcaring an array into a class. I want to initialize a char array for text limited to 50 characters and then replace the text with a function.
#ifndef MAP_H
#define MAP_H
#include "Sprite.h"
#include <SFML/Graphics.hpp>
#include <iostream>
class Map : public sprite
{
private:
char mapname[50];
int columnnumber;
int linenumber;
char casestatematricia[];
public:
void setmapname(char newmapname[50]);
void battlespace(int column, int line);
void setcasevalue(int col, int line, char value);
void printcasematricia();
};
#endif
By the way I could initialize my 2d array like that
char casestatematricia[][];
I want later to make this 2d array dynamic where I enter a column number and a line number like that
casestatematricia[linenumber][columnnumber]
to create a battlefield.
this is the cpp code so that you have an idea of what I want to do.
#include "Map.h"
#include <SFML/Graphics.hpp>
#include <iostream>
using namespace sf;
void Map::setmapname(char newmapname[50])
{
this->mapname = newmapname;
}
void Map::battlespace(int column, int line)
{
}
void Map::setcasevalue(int col, int line, char value)
{
}
void Map::printcasematricia()
{
}
thank you in advance.
Consider following common practice on this one.
Most (e.g. numerical) libraries don't use 2D arrays inside classes.
They use dynamically allocated 1D arrays and overload the () or [] operator to access the right elements in a 2D-like fashion.
So on the outside you never can tell that you're actually dealing with consecutive storage, it looks like a 2D array.
In this way arrays are easier to resize, more efficient to store, transpose and reshape.
Just a proposition for your problem:
class Map : public sprite
{
private:
std::string mapname;
int columnnumber;
int linenumber;
std::vector<char> casestatematricia;
static constexpr std::size_t maxRow = 50;
static constexpr std::size_t maxCol = 50;
public:
Map():
casestatematricia(maxRow * maxCol, 0)
{}
void setmapname(std::string newmapname)
{
if (newmapname.size() > 50)
{
// Manage error if you really need no more 50 characters..
// Or just troncate when you serialize!
}
mapname = newmapname;
}
void battlespace(int col, int row);
void setcasevalue(int col, int row, char value)
{
// check that col and line are between 0 and max{Row|Column} - 1
casestatematricia[row * maxRow + col] = value;
}
void printcasematricia()
{
for (std::size_t row = 0; row < maxRow; ++row)
{
for (std::size_t col = 0; col < maxCol; ++col)
{
char currentCell = casestatematricia[row * maxRow + col];
}
}
}
};
For access to 1D array like a 2D array, take a look at Access a 1D array as a 2D array in C++.
When you think about serialization, I guess you want to save it to a file. Just a advice: don't store raw memory to a file just to "save" time when your relaunch your soft. You just have a non portable solution! And seriously, with power of your computer, you don't have to be worry about time to load from file!
I propose you to add 2 methods in your class to save Map into file
void dump(std::ostream &os)
{
os << mapname << "\n";
std::size_t currentRow = 0;
for(auto c: casestatematricia)
{
os << static_cast<int>(c) << " ";
++currentRow;
if (currentRow >= maxRow)
{
currentRow = 0;
os << "\n";
}
}
}
void load(std::istream &is)
{
std::string line;
std::getline(is, line);
mapname = line;
std::size_t current_cell = 0;
while(std::getline(is, line))
{
std::istringstream is(line);
while(!is.eof())
{
char c;
is >> c;
casestatematricia[current_cell] = c;
++current_cell;
}
}
}
This solution is only given for example. They doesn't manage error and I have choose to store it in ASCII in file. You can change to store in binary, but, don't use direct write of raw memory. You can take a look at C - serialization techniques (just have to translate to C++). But please, don't use memcpy or similar technique to serialize
I hope I get this right. You have two questions. You want know how to assign the value of char mapname[50]; via void setmapname(char newmapname[50]);. And you want to know how to create a dynamic size 2D array.
I hope you are comfortable with pointers because in both cases, you need it.
For the first question, I would like to first correct your understanding of void setmapname(char newmapname[50]);. C++ functions do not take in array. It take in the pointer to the array. So it is as good as writing void setmapname(char *newmapname);. For better understanding, go to Passing Arrays to Function in C++
With that, I am going to change the function to read in the length of the new map name. And to assign mapname, just use a loop to copy each of the char.
void setmapname(char *newmapname, int length) {
// ensure that the string passing in is not
// more that what mapname can hold.
length = length < 50 ? length : 50;
// loop each value and assign one by one.
for(int i = 0; i < length; ++i) {
mapname[i] = newmapname[i];
}
}
For the second question, you can use vector like what was proposed by Garf365 need to use but I prefer to just use pointer and I will use 1D array to represent 2d battlefield. (You can read the link Garf365 provide).
// Declare like this
char *casestatematricia; // remember to initialize this to 0.
// Create the battlefield
void Map::battlespace(int column, int line) {
columnnumber = column;
linenumber = line;
// Clear the previous battlefield.
clearspace();
// Creating the battlefield
casestatematricia = new char[column * line];
// initialise casestatematricia...
}
// Call this after you done using the battlefield
void Map::clearspace() {
if (!casestatematricia) return;
delete [] casestatematricia;
casestatematricia = 0;
}
Just remember to call clearspace() when you are no longer using it.
Just for your benefit, this is how you create a dynamic size 2D array
// Declare like this
char **casestatematricia; // remember to initialize this to 0.
// Create the battlefield
void Map::battlespace(int column, int line) {
columnnumber = column;
linenumber = line;
// Clear the previous battlefield.
clearspace();
// Creating the battlefield
casestatematricia = new char*[column];
for (int i = 0; i < column; ++i) {
casestatematricia[i] = new char[line];
}
// initialise casestatematricia...
}
// Call this after you done using the battlefield
void Map::clearspace() {
if (!casestatematricia) return;
for(int i = 0; i < columnnumber; ++i) {
delete [] casestatematricia[i];
}
delete [][] casestatematricia;
casestatematricia = 0;
}
Hope this help.
PS: If you need to serialize the string, you can to use pascal string format so that you can support string with variable length. e.g. "11hello world", or "3foo".

Runtime Error in HDF5 file manipulation

I was trying a program where I'll convert an array of structures to byte array and then save them to hdf5 dataset multiple times. (Dataset has dimension of 100, so Ill do the write operation 100 times). I dont have any problems in converting structure to byte array , I seem to run into problem when I try to select the hyperslab where I need to write data in the dataset. I am new to hdf5. Please help me with this problem.
#include "stdafx.h"
#include "h5cpp.h"
#include <iostream>
#include <conio.h>
#include <string>
#ifndef H5_NO_NAMESPACE
using namespace H5;
#endif
using std::cout;
using std::cin;
using std::string;
const H5std_string fName( "dset.h5" );
const H5std_string dsName( "dset" );
struct MyStruct
{
int x[1000],y[1000];
double z[1000];
};
int main()
{
try
{
MyStruct obj[10];
char* totalData;
char* inData;
hsize_t offset[1],count[1];
H5File file("sample.h5", H5F_ACC_TRUNC);
StrType type(PredType::C_S1,100*sizeof(obj));
Group *myGroup = new Group(file.createGroup("\\myGroup"));
hsize_t dim[] = {100};
DataSpace dSpace(1,dim);
DataSet dSet = myGroup->createDataSet("dSet", type, dSpace);
for(int m = 0; m < 100 ; m++)
{
for(int j = 0 ; j < 10 ; j++)
{
for(int i = 0 ; i < 1000 ; i++) // some random values stored
{
obj[j].x[i] = i*13 + i*19;
obj[j].y[i] = i*37 - i*18;
obj[j].z[i] = (i + 1) / (0.4 * i);
}
}
totalData = new char[sizeof(obj)]; // converting struct to byte array
memcpy(totalData, &obj, sizeof(obj));
cout<<"Start Write.\n";
cout<<"Total Size : "<<sizeof(obj)/1000<<"KB\n";
//Exception::dontPrint();
hsize_t dim[] = { 1 }; //I think am screwing up between this line and following 5 lines
DataSpace memSpace(1, dim);
offset[0] = m;
count[0] = 1;
dSpace.selectHyperslab(H5S_SELECT_SET, count, offset);
dSet.write(totalData, type, memSpace, dSpace);
cout<<"Write Done.\n";
cout<<"Read Start.\n";
inData = new char[sizeof(obj)];
dSet.read(inData, type);
cout<<"Read Done\n";
}
delete myGroup;
}
catch(Exception e)
{
e.printError();
}
_getch();
return 0;
}
The Output I get is,
And when I use H5S_SELECT_APPEND instead of H5S_SELECT_SET, the output says
Start Write.
Total Size : 160KB
HDF5-DIAG: Error detected in HDF5 (1.8.12) thread 0:
#000: ..\..\src\H5Shyper.c line 6611 in H5Sselect_hyperslab(): unable to set hyperslab selection
major: Dataspace
minor: Unable to initialize object
#001: ..\..\src\H5Shyper.c line 6477 in H5S_select_hyperslab(): invalid selection operation
major: Invalid arguments to routine
minor: Feature is unsupported
Please, help me with this situation. Thanks in advance..
The main problem is the size of your type datatype. It should be sizeof(obj) and not 100*sizeof(obj).
And anyway, you shouldn't be using a string datatype but an opaque datatype since that's what it is, so you can replace this whole line by:
DataType type(H5T_OPAQUE, sizeof(obj));
The second problem is in the read. Either you read everything and you need to make sure inData is big enough, that is 100*sizeof(obj) instead of sizeof(obj), or you need to select just the element you want to read just like for the write.

Efficient Computation of Frequent and Top-k Elements in Data Streams

Here is the pseduo code for this algorithm.
Following is how I have implemented this.
#include <iostream>
#include <fstream>
#include <string>
#include <map>
typedef std::map<std::string, int> collection_t;
typedef collection_t::iterator collection_itr_t;
collection_t T;
collection_itr_t get_smallest_key() {
collection_itr_t min_key = T.begin();
collection_itr_t key = ++min_key;
while ( key != T.end() ) {
if ( key->second < min_key->second )
min_key = key;
++key;
}
return min_key;
}
void space_saving_frequent( std::string &i, int k ) {
if ( T.find(i) != T.end())
T[i]++;
else if ( T.size() < k ) {
T.insert(std::make_pair(i, 1 ));
} else {
collection_itr_t j = get_smallest_key();
int cnt = j->second + 1;
T.erase(j);
T.insert(std::make_pair(i, cnt));
}
}
int main ( int argc, char **argv) {
std::ifstream ifs(argv[1]);
if ( ifs.peek() == EOF )
return 1;
std::string line;
while( std::getline(ifs,line) ) {
std::string::size_type left = line.rfind('=') + 1;
std::string::size_type length = line.length();
std::string i = line.substr(left, length - left - 1);
space_saving_frequent(i, 5);
}
ifs.close();
return 0;
}
Original paper link : http://dimacs.rutgers.edu/~graham/pubs/papers/freqcacm.pdf
But code does not work, and I am no able to figure out where I am wrong.
If the items with least count are two or more, you can simply break ties arbitrarily by choosing, for instance, the item with lowest index stored in your data structure, or a random one among those of lowest count etc.
If you want to compare your implementation with a reference one, take a look at the implementation of Cormode and Hadjieleftheriou that you will find here. The code is more complex than yours, because you are not actually implementing the stream summary data structure. Their code also includes implementations for several other frequent items algorithms, and the authors compared the performances of those algorithms. Space saving proved to be in the majority of the cases, the best algorithm, with regard to several metrics such as precision, recall, update speed, space used etc. You will also find a paper discussing this experimental comparison. An improved version of this paper appeared later in Communications of the ACM. Here you can access a pdf version.

efficent way to save objects into binary files

I've a class that consists basically of a matrix of vectors: vector< MyFeatVector<T> > m_vCells, where the outer vector represents the matrix. Each element in this matrix is then a vector (I extended the stl vector class and named it MyFeatVector<T>).
I'm trying to code an efficient method to store objects of this class in binary files.
Up to now, I require three nested loops:
foutput.write( reinterpret_cast<char*>( &(this->at(dy,dx,dz)) ), sizeof(T) );
where this->at(dy,dx,dz) retrieves the dz element of the vector at position [dy,dx].
Is there any possibility to store the m_vCells private member without using loops? I tried something like: foutput.write(reinterpret_cast<char*>(&(this->m_vCells[0])), (this->m_vCells.size())*sizeof(CFeatureVector<T>)); which seems not to work correctly. We can assume that all the vectors in this matrix have the same size, although a more general solution is also welcomed :-)
Furthermore, following my nested-loop implementation, storing objects of this class in binary files seem to require more physical space than storing the same objects in plain-text files. Which is a bit weird.
I was trying to follow the suggestion under http://forum.allaboutcircuits.com/showthread.php?t=16465 but couldn't arrive into a proper solution.
Thanks!
Below a simplified example of my serialization and unserialization methods.
template < typename T >
bool MyFeatMatrix<T>::writeBinary( const string & ofile ){
ofstream foutput(ofile.c_str(), ios::out|ios::binary);
foutput.write(reinterpret_cast<char*>(&this->m_nHeight), sizeof(int));
foutput.write(reinterpret_cast<char*>(&this->m_nWidth), sizeof(int));
foutput.write(reinterpret_cast<char*>(&this->m_nDepth), sizeof(int));
//foutput.write(reinterpret_cast<char*>(&(this->m_vCells[0])), nSze*sizeof(CFeatureVector<T>));
for(register int dy=0; dy < this->m_nHeight; dy++){
for(register int dx=0; dx < this->m_nWidth; dx++){
for(register int dz=0; dz < this->m_nDepth; dz++){
foutput.write( reinterpret_cast<char*>( &(this->at(dy,dx,dz)) ), sizeof(T) );
}
}
}
foutput.close();
return true;
}
template < typename T >
bool MyFeatMatrix<T>::readBinary( const string & ifile ){
ifstream finput(ifile.c_str(), ios::in|ios::binary);
int nHeight, nWidth, nDepth;
finput.read(reinterpret_cast<char*>(&nHeight), sizeof(int));
finput.read(reinterpret_cast<char*>(&nWidth), sizeof(int));
finput.read(reinterpret_cast<char*>(&nDepth), sizeof(int));
this->resize(nHeight, nWidth, nDepth);
for(register int dy=0; dy < this->m_nHeight; dy++){
for(register int dx=0; dx < this->m_nWidth; dx++){
for(register int dz=0; dz < this->m_nDepth; dz++){
finput.read( reinterpret_cast<char*>( &(this->at(dy,dx,dz)) ), sizeof(T) );
}
}
}
finput.close();
return true;
}
A most efficient method is to store the objects into an array (or contiguous space), then blast the buffer to the file. An advantage is that the disk platters don't have waste time ramping up and also the writing can be performed contiguously instead of in random locations.
If this is your performance bottleneck, you may want to consider using multiple threads, one extra thread to handle the output. Dump the objects into a buffer, set a flag, then the writing thread will handle the output, releaving your main task to perform more important tasks.
Edit 1: Serializing Example
The following code has not been compiled and is for illustrative purposes only.
#include <fstream>
#include <algorithm>
using std::ofstream;
using std::fill;
class binary_stream_interface
{
virtual void load_from_buffer(const unsigned char *& buf_ptr) = 0;
virtual size_t size_on_stream(void) const = 0;
virtual void store_to_buffer(unsigned char *& buf_ptr) const = 0;
};
struct Pet
: public binary_stream_interface,
max_name_length(32)
{
std::string name;
unsigned int age;
const unsigned int max_name_length;
void load_from_buffer(const unsigned char *& buf_ptr)
{
age = *((unsigned int *) buf_ptr);
buf_ptr += sizeof(unsigned int);
name = std::string((char *) buf_ptr);
buf_ptr += max_name_length;
return;
}
size_t size_on_stream(void) const
{
return sizeof(unsigned int) + max_name_length;
}
void store_to_buffer(unsigned char *& buf_ptr) const
{
*((unsigned int *) buf_ptr) = age;
buf_ptr += sizeof(unsigned int);
std::fill(buf_ptr, 0, max_name_length);
strncpy((char *) buf_ptr, name.c_str(), max_name_length);
buf_ptr += max_name_length;
return;
}
};
int main(void)
{
Pet dog;
dog.name = "Fido";
dog.age = 5;
ofstream data_file("pet_data.bin", std::ios::binary);
// Determine size of buffer
size_t buffer_size = dog.size_on_stream();
// Allocate the buffer
unsigned char * buffer = new unsigned char [buffer_size];
unsigned char * buf_ptr = buffer;
// Write / store the object into the buffer.
dog.store_to_buffer(buf_ptr);
// Write the buffer to the file / stream.
data_file.write((char *) buffer, buffer_size);
data_file.close();
delete [] buffer;
return 0;
}
Edit 2: A class with a vector of strings
class Many_Strings
: public binary_stream_interface
{
enum {MAX_STRING_SIZE = 32};
size_t size_on_stream(void) const
{
return m_string_container.size() * MAX_STRING_SIZE // Total size of strings.
+ sizeof(size_t); // with room for the quantity variable.
}
void store_to_buffer(unsigned char *& buf_ptr) const
{
// Treat the vector<string> as a variable length field.
// Store the quantity of strings into the buffer,
// followed by the content.
size_t string_quantity = m_string_container.size();
*((size_t *) buf_ptr) = string_quantity;
buf_ptr += sizeof(size_t);
for (size_t i = 0; i < string_quantity; ++i)
{
// Each string is a fixed length field.
// Pad with '\0' first, then copy the data.
std::fill((char *)buf_ptr, 0, MAX_STRING_SIZE);
strncpy(buf_ptr, m_string_container[i].c_str(), MAX_STRING_SIZE);
buf_ptr += MAX_STRING_SIZE;
}
}
void load_from_buffer(const unsigned char *& buf_ptr)
{
// The actual coding is left as an exercise for the reader.
// Psuedo code:
// Clear / empty the string container.
// load the quantity variable.
// increment the buffer variable by the size of the quantity variable.
// for each new string (up to the quantity just read)
// load a temporary string from the buffer via buffer pointer.
// push the temporary string into the vector
// increment the buffer pointer by the MAX_STRING_SIZE.
// end-for
}
std::vector<std::string> m_string_container;
};
I'd suggest you to read C++ FAQ on Serialization and you can choose what best fits for your
When you're working with structures and classes, you've to take care of two things
Pointers inside the class
Padding bytes
Both of these could make some notorious results in your output. IMO, the object must implement to serialize and de-serialize the object. The object can know well about the structures, pointers data etc. So it can decide which format can be implemented efficiently.
You will have to iterate anyway or has to wrap it somewhere. Once you finished implementing the serialization and de-serialization function (either you can write using operators or functions). Especially when you're working with stream objects, overloading << and >> operators would be easy to pass the object.
Regarding your question about using underlying pointers of vector, it might work if it's a single vector. But it's not a good idea in the other way.
Update according to the question update.
There are few things you should mind before overriding STL members. They're not really a good candidate for inheritance because it doesn't have any virtual destructors. If you're using basic data types and POD like structures it wont make much issues. But if you use it truly object oriented way, you may face some unpleasant behavior.
Regarding your code
Why you're typecasting it to char*?
The way you serialize the object is your choice. IMO what you did is a basic file write operation in the name of serialization.
Serialization is down to the object. i.e the parameter 'T' in your template class. If you're using POD, or basic types no need of special synchronization. Otherwise you've to carefully choose the way to write the object.
Choosing text format or binary format is your choice. Text format has always has a cost at the same time it's easy to manipulate it rather than binary format.
For example the following code is for simple read and write operation( in text format).
fstream fr("test.txt", ios_base::out | ios_base::binary );
for( int i =0;i <_countof(arr);i++)
fr << arr[i] << ' ';
fr.close();
fstream fw("test.txt", ios_base::in| ios_base::binary);
int j = 0;
while( fw.eof() || j < _countof(arrout))
{
fw >> arrout[j++];
}
It seems to me, that the most direct root to generate a binary file containing a vector is to memory map the file and place it in the mapped region. As pointed out by sarat, you need to worry about how pointers are used within the class. But, boost-interprocess library has a tutorial on how to do this using their shared memory regions which include memory mapped files.
First off, have you looked at Boost.multi_array? Always good to take something ready-made rather than reinventing the wheel.
That said, I'm not sure if this is helpful, but here's how I would implement the basic data structure, and it'd be fairly easy to serialize:
#include <array>
template <typename T, size_t DIM1, size_t DIM2, size_t DIM3>
class ThreeDArray
{
typedef std::array<T, DIM1 * DIM2 * DIM3> array_t;
array_t m_data;
public:
inline size_t size() const { return data.size(); }
inline size_t byte_size() const { return sizeof(T) * data.size(); }
inline T & operator()(size_t i, size_t j, size_t k)
{
return m_data[i + j * DIM1 + k * DIM1 * DIM2];
}
inline const T & operator()(size_t i, size_t j, size_t k) const
{
return m_data[i + j * DIM1 + k * DIM1 * DIM2];
}
inline const T * data() const { return m_data.data(); }
};
You can serialize the data buffer directly:
ThreeDArray<int, 4, 6 11> arr;
/* ... */
std::ofstream outfile("file.bin");
outfile.write(reinterpret_cast<char*>(arr.data()), arr.byte_size());