“2D table with headers” type container - c++

Consider the following data structure:
class Column {
std::vector<Cell> cells; // in my case, usually 1 to 10 elements
Header header; // some additional information
};
std::vector<Column> table;
// important: we now that all Columns have the same number of cells.
// you could also think of it as:
std::vector<std::pair<Header, std::vector<Cell>>> table_with_pairs;
Of course, this is fairly inefficient in terms of memory usage, and for certain operations (I am thinking of random access, or iterating over smaller parts of the table). Is there a ready-made container (probably non-STL) that could handle this “2D table with headers” better? Or a common practice? The goal would be a high-performance, easy-to-maintain and readable solution.
This problem seems related to the well-known “2D array” task but we have the additional header information. One solution I can think of is to store the headers in a std::vector parallel to a 2D array (pick your favourite solution) but that’s not easy to maintain, might have bad locality, and is against the principles of OOP. I’d prefer a memory layout similar to a std::vector<std::pair<Header, std::array<Cell,n>>> with n not known at compile time but constant. Determining the address of a cell (or header) would be nearly as trivial as in a 2D array but I thought I’d ask before coding my own solution.
[Edit] To clarify, I am not thinking about things like Excel sheets here. I could have called it “vector of objects that contain a vector and other data”.

Related

Data structure for sparse insertion

I am asking this question mostly for confirmation, because I am not an expert in data structures, but I think the structure that suits my need is a hashmap.
Here is my problem (which I guess is typical?):
We are looking at pairwise interactions between a large number of objects (say N=90k), so think about the storage as a sparse matrix;
There is a process, say (P), which randomly starts from one object, and computes a model which may lead to another object: I cannot predict the pairs in advance, so I need to be able to "create" entries dynamically (arguably the performance is not very critical here);
The process (P) may "revisit" existing pairs and update the corresponding element in the matrix: this happens a lot, and therefore I need to be able to find and update as fast as possible.
Finally, the process (P) is repeated millions of times, but only requires write access to the data structure, it does not need to know about the latest "state" of that storage. This feels intuitively like a detail that might be exploited to improve performance, but I don't think hashmaps do.
This last point is actually the main reason for my question here: is there a data-structure which satisfies the first three points (I'm thinking hash-map, correct?), and which would also exploit the last feature for improved performance (I'm thinking something like buffering operations and execute them in bulk asynchronously)?
EDIT: I am working with C++, and would prefer it if there was an existing library implementing that data structure. In addition, I am limited by the system requirements; I cannot use C++11 features.
I would use something like:
#include <boost/unordered_map.hpp>
class Data
{
boost::unordered_map<std::pair<int,int>,double> map;
public:
void update(int i, int j, double v)
{
map[std::pair<int,int>(i,j)] += v;
}
void output(); // Prints data somewhere.
};
That will get you going (you may need to declare a suitable hash function). You might be able to speed things up by making the key type be a 64-bit integer, and using ((int64_t)i << 32) | j to make the index.
If nearly all the updates go to a small fraction of the pairs, you could have two maps (small and large), and directly update the small map. Every time the size of small passed a threshold, you could update large and clear small. You would need to do some carefully testing to see if this helped or not. The only reason I think it might help, is by improving cache locality.
Even if you end up using a different data structure, you can keep this class interface, and the rest of the code will be undisturbed. In particular, dropping sparsehash into the same structure will be very easy.

Multi-dimensional dynamic arrays in classes in C++

I am a relative beginner to C++. I am working on a model related to forecasting property financials, and I am having a few issues getting my data structures setup.
A bit of background - the specific task I am trying to do it setup class variables for key data structures - one such structure called PropFinance. This structure will house all of my key information on a given property (with iterations for each property in a collection of them), including forecasts of future performance. Two of the main arguments being passed to the program are (applicable to all properties to be evaluated)
(1) number of iterations (Iterations) - how many times we are going to generate a forecast (random iterations)
(2) length of forecast (NumofPeriods) - how many periods we are going to forecast
The PropFinance class has 79 variables in it containing property details. A simple example - Expenses. For expenses, and many of my variables like it, I will need to create a 3D array of doubles - one dimension for each iteration, one dimension for each forecasted period. So ideally, I would have a variable for Expenses of:
class PropFinance {
double Expenses[Iterations][NumofPeriods];
}
but, I don't know Iterations and NumofPeriods at compile time. I do know the value of these two variables at the outset of runtime (and they will be constant for all iterations/properties of the current program execution)
My question is how can I have the size of these arrays dynamically updated when the program runs? Based on my research on this site and others, it seems like the two main ways to accomplish this are
(1) Use
(2) Use a pointer in the class definition and then use new and delete to manage
But even with those two options, I am not sure if it will work with a third dimension (all of the examples I saw needed just a single dimension to be dynamically sized). Could someone post either a verbal explanation or (better) a simple code example showing how this would work in (1) or (2) above? Any guidance on which option is preferable would be appreciated (but don't want to start a "what's better" debate). It seems like vector is more appropriate when the size of the array is going to be constantly changing, which is not the case here...
The overall speed of this model is critical, and as we expand the number of iterations and properties things get large quickly - so I want to do things as efficiently as possible.
Sorry I didn't post code - I can try to put something together if people are unable to discern what I am asking from above.
The idiomatic solution is to avoid direct heap allocations of C-arrays, and to prefer an STL container like std::vector, which automatically handles resizing, iteration, and element access in an efficient, portable manner. I would highly recommend Scott Meyers' Effective STL, which talks about appropriateness of each container for different applications - insertion/removal/retrieval complexity gaurantees, etc.
If you need more than 2 dimensions(3, 4, 5 and so on).The most easiest solution I know is using the multi_array provided by boost.
If you only need two dimension array, use vector
std::vector<std::vector<double> > Expenses;
Since you are a beginner, you better start with the higher level components provided by c++, even you are familiar with c++, you should stay with those high level components too.The basic elements of c++ are used when you need to develop some infrastructure(vector, list, smart pointers, thread and so on).
#include <iostream>
#include <vector>
int main()
{
std::vector<std::vector<double> > expenses[10]; //contains 10 std::vector<double>
expenses[0].push_back(100);
std::cout<<expenses[0][0]<<std::endl;
expenses.push_back(std::vector<double>()); //now expenses has 11 std::vector<double>
return 0;
}
how to use vector
multi array
I think you are approaching object oriented programming wrong.
Instead of having a master class PropFinance with everything in many dimensions arrays. Have you considered having classes like Iteration which has multiple Period such as
class Iteration
{
std::vector<Period­> _periods;
}
class Period
{
public:
double Expense;
}
Then as you add more dimensions you can create super classes PropFinance
class PropFinance
{
std::vector<Iteration> _iterations;
}
This makes everything more manageable instead of having deeply nested arrays [][][][]. As a rule of thumb, whenever you have multiple dimension arrays, consider creating subclasses containing the other dimension.

C++ efficiently extracting subsets from vector of user defined structure

Let me preface this with the statement that most of my background has been with functional programming languages so I'm fairly novice with C++.
Anyhow, the problem I'm working on is that I'm parsing a csv file with multiple variable types. A sample line from the data looks as such:
"2011-04-14 16:00:00, X, 1314.52, P, 812.1, 812"
"2011-04-14 16:01:00, X, 1316.32, P, 813.2, 813.1"
"2011-04-14 16:02:00, X, 1315.23, C, 811.2, 811.1"
So what I've done is defined a struct which stores each line. Then each of these are stored in a std::vector< mystruct >. Now say I want to subset this vector by column 4 into two vectors where every element with P in it is in one and C in the other.
Now the example I gave is fairly simplified, but the actual problem involves subsetting multiple times.
My initial naive implementation was iterate through the entire vector, creating individual subsets defined by new vectors, then subsetting those newly created vectors. Maybe something a bit more memory efficient would be to create an index, which would then be shrunk down.
Now my question is, is there a more efficient, in terms of speed/memory usage) way to do this either by this std::vector< mystruct > framework or if there's some better data structure to handle this type of thing.
Thanks!
EDIT:
Basically the output I'd like is first two lines and last line separately. Another thing worth noting, is that typically the dataset is not ordered like the example, so the Cs and Ps are not grouped together.
I've used std::partition for this. It's not part of boost though.
If you want a data structure that allows you to move elements between different instances cheaply, the data structure you are looking for is std::list<> and it's splice() family of functions.
I understand you have not per se trouble doing this but you seem to be concerned about memory usage and performance.
Depending on the size of your struct and the number of entries in the csv file it may be advisabe to use a smart pointer if you don't need to modify the partitioned data so the mystruct objects are not copied:
typedef std::vector<boost::shared_ptr<mystruct> > table_t;
table_t cvs_data;
If you use std::partition (as another poster suggested) you need to define a predicate that takes the indirection of the shared_ptr into accont.

Vector versus dynamic array, does it make a big difference in speed?

Now I am writing some code for solving vehicle routing problems. To do so, one important decision is to choose how to encode the solutions. A solution contains several routes, one for each vehicle. Each route has a customer visiting sequence, the load of route, the length of route.
To perform modifications on a solution the information, I also need to quickly find some information.
For example,
Which route is a customer in?
What customers does a route have?
How many nodes are there in a route?
What nodes are in front of or behind a node?
Now, I am thinking to use the following structure to keep a solution.
struct Sol
{
vector<short> nextNode; // show what is the next node of each node;
vector<short> preNode; //show what is the preceding node
vector<short> startNode;
vector<short> rutNum;
vector<short> rutLoad;
vector<float> rutLength;
vector<short> rutSize;
};
The common size of each vector is instance dependent, between 200-2000.
I heard it is possible to use dynamic array to do this job. But it seems to me dynamic array is more complicated. One has to locate the memory and release the memory. Here my question is twofold.
How to use dynamic array to realize the same purpose? how to define the struct or class so that memory location and release can be easily taken care of?
Will using dynamic array be faster than using vector? Assuming the solution structure need to be accessed million times.
It is highly unlikely that you'll see an appreciable performance difference between a dynamic array and a vector since the latter is essentially a very thin wrapper around the former. Also bear in mind that using a vector would be significantly less error-prone.
It may, however, be the case that some information is better stored in a different type of container altogether, e.g. in an std::map. The following might be of interest: What are the complexity guarantees of the standard containers?
It is important to give some thought to the type of container that gets used. However, when it comes to micro-optimizations (such as vector vs dynamic array), the best policy is to profile the code first and only focus on parts of the code that prove to be real -- rather than assumed -- bottlenecks.
It's quite possible that vector's code is actually better and more performant than dynamic array code you would write yourself. Only if profiling shows significant time spent in vector would I consider writing my own error-prone replacement. See also Dynamically allocated arrays or std::vector
I'm using MSVC and the implementation looks to be as quick as it can be.
Accessing the array via operator [] is:
return (*(this->_Myfirst + _Pos));
Which is as quick as you are going to get with dynamic memory.
The only overhead you are going to get is in the memory use of a vector, it seems to create a pointer to the start of the vector, the end of the vector, and the end of the current sequence. This is only 2 more pointers than you would need if you were using a dynamic array. You are only creating 200-2000 of these, I doubt memory is going to be that tight.
I am sure the other stl implementations are very similar. I would absorb the minor cost of vector storage and use them in your project.

Best Data Structure for Genetic Algorithm in C++?

i need to implement a genetic algorithm customized for my problem (college project), and the first version had it coded as an matrix of short ( bits per chromosome x size of population).
That was a bad design, since i am declaring a short but only using the "0" and "1" values... but it was just a prototype and it worked as intended, and now it is time for me to develop a new, improved version. Performance is important here, but simplicity is also appreciated.
I researched around and came up with:
for the chromosome :
- String class (like "0100100010")
- Array of bool
- Vector (vectors appears to be optimized for bool)
- Bitset (sounds the most natural one)
and for the population:
- C Array[]
- Vector
- Queue
I am inclined to pick vector for chromossome and array for pop, but i would like the opinion of anyone with experience on the subject.
Thanks in advance!
I'm guessing you want random access to the population and to the genes. You say performance is important, which I interpret as execution speed. So you're probably best off using a vector<> for the chromosomes and a vector<char> for the genes. The reason for vector<char> is that bitset<> and vector<bool> are optimized for memory consumption, and are therefore slow. vector<char> will give you higher speed at the cost of x8 memory (assuming char = byte on your system). So if you want speed, go with vector<char>. If memory consumption is paramount, then use vector<bool> or bitset<>. bitset<> would seem like a natural choice here, however, bear in mind that it is templated on the number of bits, which means that a) the number of genes must be fixed and known at compile time (which I would guess is a big no-no), and b) if you use different sizes, you end up with one copy per bitset size of each of the bitset methods you use (though inlining might negate this), i.e., code bloat. Overall, I would guess vector<bool> is better for you if you don't want vector<char>.
If you're concerned about the aesthetics of vector<char> you could typedef char gene; and then use vector<gene>, which looks more natural.
A string is just like a vector<char> but more cumbersome.
Specifically to answer your question. I am not exactly sure what you are suggestion. You talk about Array and string class. Are you talking about the STL container classes where you can have a queue, bitset, vector, linked list etc. I would suggest a vector for you population (closest thing to a C array there is) and a bitset for you chromosome if you are worried about memory capacity. Else as you are already using a vector of your string representaion of your dna. ("10110110")
For ideas and a good tool to dabble. Recommend you download and initially use this library. It works with the major compilers. Works on unix variants. Has all the source code.
All the framework stuff is done for you and you will learn a lot. Later on you could write your own code from scratch or inherit from these classes. You can also use them in commercial code if you want.
Because they are objects you can change representaion of your DNA easily from integers to reals to structures to trees to bit arrays etc etc.
There is always learning cure involved but it is worth it.
I use it to generate thousands of neural nets then weed them out with a simple fitness function then run them for real.
galib
http://lancet.mit.edu/ga/
Assuming that you want to code this yourself (if you want an external library kingchris seems to have a good one there) it really depends on what kind of manipulation you need to do. To get the most bang for your buck in terms of memory, you could use any integer type and set/manipulate individual bits via bitmasks etc. Now this approach likely not optimal in terms of ease of use... The string example above would work ok, however again its not significantly different than the shorts, here you are now just representing either '0' or '1' with an 8 bit value as opposed to 16 bit value. Also, again depending on the manipulation, the string case will probably prove unwieldly. So if you could give some more info on the algorithm we could maybe give more feedback. Myself I like the individual bits as part of an integer (a bitset), but if you aren't used to masks, shifts, and all that good stuff it may not be right for you.
I suggest writing a class for each member of population, that simplifies things considerably, since you can keep all your member relevant functions in the same place nicely wrapped with the actual data.
If you need a "array of bools" I suggest using an int or several ints (then use mask and bit wise operations to access (modify / flip) each bit) depending on number of your chromosomes.
I usually used some sort of collection class for the population, because just an array of population members doesn't allow you to simply add to your population. I would suggest implementing some sort of dynamic list (if you are familiar with ArrayList then that is a good example).
I had major success with genetic algorithms with the recipe above. If you prepare your member class properly it can really simplify things and allows you to focus on coding better genetic algorithms instead of worrying about your data structures.