I am a relative beginner to C++. I am working on a model related to forecasting property financials, and I am having a few issues getting my data structures setup.
A bit of background - the specific task I am trying to do it setup class variables for key data structures - one such structure called PropFinance. This structure will house all of my key information on a given property (with iterations for each property in a collection of them), including forecasts of future performance. Two of the main arguments being passed to the program are (applicable to all properties to be evaluated)
(1) number of iterations (Iterations) - how many times we are going to generate a forecast (random iterations)
(2) length of forecast (NumofPeriods) - how many periods we are going to forecast
The PropFinance class has 79 variables in it containing property details. A simple example - Expenses. For expenses, and many of my variables like it, I will need to create a 3D array of doubles - one dimension for each iteration, one dimension for each forecasted period. So ideally, I would have a variable for Expenses of:
class PropFinance {
double Expenses[Iterations][NumofPeriods];
}
but, I don't know Iterations and NumofPeriods at compile time. I do know the value of these two variables at the outset of runtime (and they will be constant for all iterations/properties of the current program execution)
My question is how can I have the size of these arrays dynamically updated when the program runs? Based on my research on this site and others, it seems like the two main ways to accomplish this are
(1) Use
(2) Use a pointer in the class definition and then use new and delete to manage
But even with those two options, I am not sure if it will work with a third dimension (all of the examples I saw needed just a single dimension to be dynamically sized). Could someone post either a verbal explanation or (better) a simple code example showing how this would work in (1) or (2) above? Any guidance on which option is preferable would be appreciated (but don't want to start a "what's better" debate). It seems like vector is more appropriate when the size of the array is going to be constantly changing, which is not the case here...
The overall speed of this model is critical, and as we expand the number of iterations and properties things get large quickly - so I want to do things as efficiently as possible.
Sorry I didn't post code - I can try to put something together if people are unable to discern what I am asking from above.
The idiomatic solution is to avoid direct heap allocations of C-arrays, and to prefer an STL container like std::vector, which automatically handles resizing, iteration, and element access in an efficient, portable manner. I would highly recommend Scott Meyers' Effective STL, which talks about appropriateness of each container for different applications - insertion/removal/retrieval complexity gaurantees, etc.
If you need more than 2 dimensions(3, 4, 5 and so on).The most easiest solution I know is using the multi_array provided by boost.
If you only need two dimension array, use vector
std::vector<std::vector<double> > Expenses;
Since you are a beginner, you better start with the higher level components provided by c++, even you are familiar with c++, you should stay with those high level components too.The basic elements of c++ are used when you need to develop some infrastructure(vector, list, smart pointers, thread and so on).
#include <iostream>
#include <vector>
int main()
{
std::vector<std::vector<double> > expenses[10]; //contains 10 std::vector<double>
expenses[0].push_back(100);
std::cout<<expenses[0][0]<<std::endl;
expenses.push_back(std::vector<double>()); //now expenses has 11 std::vector<double>
return 0;
}
how to use vector
multi array
I think you are approaching object oriented programming wrong.
Instead of having a master class PropFinance with everything in many dimensions arrays. Have you considered having classes like Iteration which has multiple Period such as
class Iteration
{
std::vector<Period> _periods;
}
class Period
{
public:
double Expense;
}
Then as you add more dimensions you can create super classes PropFinance
class PropFinance
{
std::vector<Iteration> _iterations;
}
This makes everything more manageable instead of having deeply nested arrays [][][][]. As a rule of thumb, whenever you have multiple dimension arrays, consider creating subclasses containing the other dimension.
Related
Consider the following data structure:
class Column {
std::vector<Cell> cells; // in my case, usually 1 to 10 elements
Header header; // some additional information
};
std::vector<Column> table;
// important: we now that all Columns have the same number of cells.
// you could also think of it as:
std::vector<std::pair<Header, std::vector<Cell>>> table_with_pairs;
Of course, this is fairly inefficient in terms of memory usage, and for certain operations (I am thinking of random access, or iterating over smaller parts of the table). Is there a ready-made container (probably non-STL) that could handle this “2D table with headers” better? Or a common practice? The goal would be a high-performance, easy-to-maintain and readable solution.
This problem seems related to the well-known “2D array” task but we have the additional header information. One solution I can think of is to store the headers in a std::vector parallel to a 2D array (pick your favourite solution) but that’s not easy to maintain, might have bad locality, and is against the principles of OOP. I’d prefer a memory layout similar to a std::vector<std::pair<Header, std::array<Cell,n>>> with n not known at compile time but constant. Determining the address of a cell (or header) would be nearly as trivial as in a 2D array but I thought I’d ask before coding my own solution.
[Edit] To clarify, I am not thinking about things like Excel sheets here. I could have called it “vector of objects that contain a vector and other data”.
I am asking this question mostly for confirmation, because I am not an expert in data structures, but I think the structure that suits my need is a hashmap.
Here is my problem (which I guess is typical?):
We are looking at pairwise interactions between a large number of objects (say N=90k), so think about the storage as a sparse matrix;
There is a process, say (P), which randomly starts from one object, and computes a model which may lead to another object: I cannot predict the pairs in advance, so I need to be able to "create" entries dynamically (arguably the performance is not very critical here);
The process (P) may "revisit" existing pairs and update the corresponding element in the matrix: this happens a lot, and therefore I need to be able to find and update as fast as possible.
Finally, the process (P) is repeated millions of times, but only requires write access to the data structure, it does not need to know about the latest "state" of that storage. This feels intuitively like a detail that might be exploited to improve performance, but I don't think hashmaps do.
This last point is actually the main reason for my question here: is there a data-structure which satisfies the first three points (I'm thinking hash-map, correct?), and which would also exploit the last feature for improved performance (I'm thinking something like buffering operations and execute them in bulk asynchronously)?
EDIT: I am working with C++, and would prefer it if there was an existing library implementing that data structure. In addition, I am limited by the system requirements; I cannot use C++11 features.
I would use something like:
#include <boost/unordered_map.hpp>
class Data
{
boost::unordered_map<std::pair<int,int>,double> map;
public:
void update(int i, int j, double v)
{
map[std::pair<int,int>(i,j)] += v;
}
void output(); // Prints data somewhere.
};
That will get you going (you may need to declare a suitable hash function). You might be able to speed things up by making the key type be a 64-bit integer, and using ((int64_t)i << 32) | j to make the index.
If nearly all the updates go to a small fraction of the pairs, you could have two maps (small and large), and directly update the small map. Every time the size of small passed a threshold, you could update large and clear small. You would need to do some carefully testing to see if this helped or not. The only reason I think it might help, is by improving cache locality.
Even if you end up using a different data structure, you can keep this class interface, and the rest of the code will be undisturbed. In particular, dropping sparsehash into the same structure will be very easy.
C++ newbie here! I would like to simulate a population containing patches, containing individuals, containing chromosomes, containing genes.
What are the pros and cons of using a series of simple classes versus a highly dimensional matrix in C++? Typically, does the time to access a memory slot varies in between the two technics?
Highly dimensional Matrix
One could make "a vector of vectors of vectors of vectors" (or a C-style highly dimensional arrays of integers) and access any gene in memory with
for (int patch=0;patch<POP.size();patch++)
{
for (int ind=0;ind<POP[patch].size();patch++)
{
for (int chrom=0;chrom<POP[patch][ind].size();chrom++)
{
for (int gene=0;gene<POP[patch][ind][chrom].size();gene++)
{
POP[patch][ind][chrom][gene];
}
}
}
}
Series of Simple Classes
One could use a series of simple classes and access any gene in memory with
for (int patch=0;patch<POP->PATCHES.size();patch++)
{
for (int ind=0;ind<POP->PATCHES[patch]->INDIVIDUALS.size();patch++)
{
for (int chrom=0;chrom<POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES.size();chrom++)
{
for (int gene=0;gene<POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES[chrom]->GENES.size();gene++)
{
POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES[chrom]->GENES[gene];
}
}
}
}
While a high-dimensional matrix would work, consider that you might want to add more information to an individual. It might not just have chromosomes, but also an age, siblings, parents, phenotypes, et cetera. It is then natural to have a class Individual, which can contain all that information along with its list of chromosomes. Using classes will group relevant information together.
While I in general agree with #g-sliepen's answer, there is an additional point you should know about:
C++ gives you the ability to make a distinction between interface and type. You can leave the type abstract for the users of your code (even if that is only you) and provide a finite set of operations on it.
Using this pattern allows you to change the implementation completely (e.g. back to vectors for parallel computation etc.) later without having to change the code using it (e.g. a concrete simulation).
I won't cover what's already been suggested as it is generally a good idea to store your individual entities as a class with all relevant fields associared with it, but I'll just address your first suggestion:
The issue with using something like a std::vector<std::vector<std::vector<std::vector<type>>>> (apart from the fact it's a pain to handle generically) is that whilst the overall std::vector enclosing the structure has contiguous storage (so long as you aren't storing bools in your std::vector that is) the inner vectors are not contiguous with each other or the other elements.
Due to this, if you are storing a large amount of data in your structure and need access and iteration to be as fast as possible, this method of storage is not ideal - it also complicates matters of iterating through the entire structure.
A good solution for storing a large multi-dimensional "matrix" (technically a rank 4 tensor in this case I suppose) when you require fast iteration and random access is to write a wrapper around a single std::vector in some row-major / column-major configuration such that all your data is stored as a contiguous block and you can iterate over it all via a single loop or call to std::for_each (for example). Then each index by which you access the structure would correspond to patch, ind, chrom and gene in order.
An example of a pre-built data structure which could handle this is boost::multi_array if you'd rather not code the wrapper yourself.
There are two major ways to do multidimensional arrays. Vector of vectors (aka jagged array) and really multidimensional array - n dimensional cube. Using the latter one means for example, that all chromozomes have the same amount of genes and every individual has the same amount of chromozomes. If You can accept those restrictions, You get some advantages like continuous memory storage.
pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.
Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.
Now I am writing some code for solving vehicle routing problems. To do so, one important decision is to choose how to encode the solutions. A solution contains several routes, one for each vehicle. Each route has a customer visiting sequence, the load of route, the length of route.
To perform modifications on a solution the information, I also need to quickly find some information.
For example,
Which route is a customer in?
What customers does a route have?
How many nodes are there in a route?
What nodes are in front of or behind a node?
Now, I am thinking to use the following structure to keep a solution.
struct Sol
{
vector<short> nextNode; // show what is the next node of each node;
vector<short> preNode; //show what is the preceding node
vector<short> startNode;
vector<short> rutNum;
vector<short> rutLoad;
vector<float> rutLength;
vector<short> rutSize;
};
The common size of each vector is instance dependent, between 200-2000.
I heard it is possible to use dynamic array to do this job. But it seems to me dynamic array is more complicated. One has to locate the memory and release the memory. Here my question is twofold.
How to use dynamic array to realize the same purpose? how to define the struct or class so that memory location and release can be easily taken care of?
Will using dynamic array be faster than using vector? Assuming the solution structure need to be accessed million times.
It is highly unlikely that you'll see an appreciable performance difference between a dynamic array and a vector since the latter is essentially a very thin wrapper around the former. Also bear in mind that using a vector would be significantly less error-prone.
It may, however, be the case that some information is better stored in a different type of container altogether, e.g. in an std::map. The following might be of interest: What are the complexity guarantees of the standard containers?
It is important to give some thought to the type of container that gets used. However, when it comes to micro-optimizations (such as vector vs dynamic array), the best policy is to profile the code first and only focus on parts of the code that prove to be real -- rather than assumed -- bottlenecks.
It's quite possible that vector's code is actually better and more performant than dynamic array code you would write yourself. Only if profiling shows significant time spent in vector would I consider writing my own error-prone replacement. See also Dynamically allocated arrays or std::vector
I'm using MSVC and the implementation looks to be as quick as it can be.
Accessing the array via operator [] is:
return (*(this->_Myfirst + _Pos));
Which is as quick as you are going to get with dynamic memory.
The only overhead you are going to get is in the memory use of a vector, it seems to create a pointer to the start of the vector, the end of the vector, and the end of the current sequence. This is only 2 more pointers than you would need if you were using a dynamic array. You are only creating 200-2000 of these, I doubt memory is going to be that tight.
I am sure the other stl implementations are very similar. I would absorb the minor cost of vector storage and use them in your project.