C++ efficiently extracting subsets from vector of user defined structure

C++ efficiently extracting subsets from vector of user defined structure - c++

Let me preface this with the statement that most of my background has been with functional programming languages so I'm fairly novice with C++.
Anyhow, the problem I'm working on is that I'm parsing a csv file with multiple variable types. A sample line from the data looks as such:
"2011-04-14 16:00:00, X, 1314.52, P, 812.1, 812"
"2011-04-14 16:01:00, X, 1316.32, P, 813.2, 813.1"
"2011-04-14 16:02:00, X, 1315.23, C, 811.2, 811.1"
So what I've done is defined a struct which stores each line. Then each of these are stored in a std::vector< mystruct >. Now say I want to subset this vector by column 4 into two vectors where every element with P in it is in one and C in the other.
Now the example I gave is fairly simplified, but the actual problem involves subsetting multiple times.
My initial naive implementation was iterate through the entire vector, creating individual subsets defined by new vectors, then subsetting those newly created vectors. Maybe something a bit more memory efficient would be to create an index, which would then be shrunk down.
Now my question is, is there a more efficient, in terms of speed/memory usage) way to do this either by this std::vector< mystruct > framework or if there's some better data structure to handle this type of thing.
Thanks!
EDIT:
Basically the output I'd like is first two lines and last line separately. Another thing worth noting, is that typically the dataset is not ordered like the example, so the Cs and Ps are not grouped together.

I've used std::partition for this. It's not part of boost though.

If you want a data structure that allows you to move elements between different instances cheaply, the data structure you are looking for is std::list<> and it's splice() family of functions.

I understand you have not per se trouble doing this but you seem to be concerned about memory usage and performance.
Depending on the size of your struct and the number of entries in the csv file it may be advisabe to use a smart pointer if you don't need to modify the partitioned data so the mystruct objects are not copied:
typedef std::vector<boost::shared_ptr<mystruct> > table_t;
table_t cvs_data;
If you use std::partition (as another poster suggested) you need to define a predicate that takes the indirection of the shared_ptr into accont.

Related

“2D table with headers” type container

Consider the following data structure:
class Column {
std::vector<Cell> cells; // in my case, usually 1 to 10 elements
Header header; // some additional information
};
std::vector<Column> table;
// important: we now that all Columns have the same number of cells.
// you could also think of it as:
std::vector<std::pair<Header, std::vector<Cell>>> table_with_pairs;
Of course, this is fairly inefficient in terms of memory usage, and for certain operations (I am thinking of random access, or iterating over smaller parts of the table). Is there a ready-made container (probably non-STL) that could handle this “2D table with headers” better? Or a common practice? The goal would be a high-performance, easy-to-maintain and readable solution.
This problem seems related to the well-known “2D array” task but we have the additional header information. One solution I can think of is to store the headers in a std::vector parallel to a 2D array (pick your favourite solution) but that’s not easy to maintain, might have bad locality, and is against the principles of OOP. I’d prefer a memory layout similar to a std::vector<std::pair<Header, std::array<Cell,n>>> with n not known at compile time but constant. Determining the address of a cell (or header) would be nearly as trivial as in a 2D array but I thought I’d ask before coding my own solution.
[Edit] To clarify, I am not thinking about things like Excel sheets here. I could have called it “vector of objects that contain a vector and other data”.

C++ mutable multitype array

I am quite new to C++ but am familiar with a few other languages.
I would like to use a data type similar to a Java ArrayList, an Objective-c NSMutableArray or a Python array, but in C++. The characteristics I am looking for are the possibility to initialize the array without a capacity (thus to be able to add items gradually), and the capability to store multiple datatypes in one array.
To give you details of what I want it for is to read data from different tables of a mysql db, without knowing the number of fields in the table, and being able to move this data around. My ideal data type would allow me to store something like this:
idealArray = [Bool error,[string array],[string array]];
where the string arrays may have different sizes, from 1 to 20 in size (relatively small).
I don't know if this is possible in C++, any help appreciated, or links towards good ressources.
Thanks

The standard dynamically sized array in C++ is std::vector<>. Homogeneous containers doesn't exist unless you introduce indirection, for this you can use either boost::variant or boost::any depending on your needs.

You may use structure or class to store (named) multiple data types together, such as:
class Record
{
bool _error;
vector<string> _v1;
vector<string> _v2;
};
vector<Record> vec;
or std::tuple to store (unnamed) multiple data types, e.g.
vector<tuple<bool, vector<string>, vector<string> > > vec;

I would suggest using a std::vector from the STL. However, note that C++ does not have a container that can contain multiple data types. There are multiple ways to simulate this behaviour though:
Derive all the "types" that you want to store from a certain "Base_type". Then store the items in the vector as std::vector<Base_type*>, but then you would need to know which item is where and the type to (dynamic)cast to, if the "types" are totally different.
Use something like std::vector<boost::any> from the boost library (but noticing that you are new to C++, that might be overkill).
In fact, the question you need to ask is, why do you want to store unrelated "types" in an "array" in the first place? and if they are related, then "how"? this will guide you in desgining a decent "Base_type" for the "types".
And finally, in short, C++ does not have homogenous array-like structures that can contain unrelated data types.

You could try to use an std::vector<boost::any> (documentation here).

C++: Implementing Row Iterator for Table

I have a generic table class implemented in C++ that uses a shared_ptr< ptr_vector< vector<T> > > as its backing, where T is an arbitrary typename; the ptr_vector contains pointers to the vectors corresponding to the columns in the table. I decided to wrap the ptr_vector in a shared_ptr since the tables may contain many millions of rows, and the vectors containing data for each column in a ptr_vector for the same reason. (Please tell me if this can be improved.)
Implementing column-wise operations on this table is trivial, since I have access to the native iterator supplied by the vector. However, I also need the table to support row-wise operations: relatively mundane operations such as adding and removing rows should be supported, as well as the ability to use the STL algorithms with the table. Now, I have run across some design issues that I need some help to address:
It seems that implementing a custom iterator to conduct row-wise operations is necessary to accomplish what is describe above. Would boost::iterator_adaptor be the right way to go about doing this?
When the user adds rows to the table, I do not wish to impose a specific data structure upon the user -- how would I go about doing this? I am thinking of accepting iterators as parameters to the add_row() method.
If you think that I should be implementing this table structure differently, I am open to any suggestions that you may have for me. It was originally designed with the intent to store strings read from tab-delimited files containing hundreds of thousands of row entries.
Thank you very much for your help!

The Boost library has a container called multi_array which provides and n-dimensional dynamic array which can be accessed and manipulated along each dimension. This seems to be very similar to what you are trying to build, perhaps you could use it instead?

Boost.MultiArray Beginner: How to get a 4D-Array with dynamic inner-array-sizes?

i want to store some kind of distance-matrix (2D), where each entry has some alternatives (different coordinates). So i want to access the distance for example x=1 with x_alt=3 and y=3 with y_alt=1, looking in a 4-dim multi-array with array[1][3][3][1].
The important thing to notice is the following: the 2 most inner arrays/vectors don't have the same size for different values of the outer ones.
After a first init step, where i calculate the values, no more modifying is needed!
This should be easily possible with the use of stl-vectors:
vector<vector<vector<vector<double> > > >`extended_distance_matrix;
where i can dynamically iterate over the outer 2 dimensions and fill only as much alternatives to the inner 2 dimensions as i need (e.g. with push_back()).
Questions:
Is this kind of data-structure definition possible with Boost.MultiArray? How?
Is it a good idea to use Boost.MultiArray instead of the nested vectors? Performance (especially lookups! (Memory-layout))? Easy of use?
Thanks for any input!
sascha
PS: The boost documentation didn't help me. Maybe one can use multi_array_ref to get already sized arrays into the whole 4D-structure?
Edit:
At the moment i'm thinking of another approach: flattening the alternatives -> one bigger matrix with all the distances between the alternatives. Then i only need to calc the number of alternatives per node, build up the prefix sum (which is describing the matrix position/shift) and can then access the information in a 2-step-way.
But my questions are still open.

it sounds like you need:
multi_array<ublas::matrix<type>,2>

Boost.MultiArray deals with contiguous memory (arranged logically in many dimensions) so it is difficult to add elements in the inner dimensions. MultiArrays can be dynamically resized, e.g. to add elements in any dimension, but this is a costly operation that almost certainly need (internally) reallocation and copying.
Because of that requirement MultiArray is not the best option. But from what you say it looks like a combination of the two would be appropriate to you.
boost::multi_array<std::vector<std::vector<type>>, 2> data
The very nice thing is that the indexing interface doesn't change with respect to boost::multi_array<type, 4>. For example data[1][2][3][4] still makes sense.
I don't know from your post how you handle the inner dimension but it could even make sense to use this:
boost::multi_array<boost::multi_array<type>, 2>, 2> data
In any case, unless you really need to do linear algebra I would stay away from boost::ublas::array, or at most use it for the internal array if type is numeric. boost::multi_array<boost::ublas::array<type>, 2> data which is mentioned in the other answer.

Choice of the most performant container (array)

This is my little big question about containers, in particular, arrays.
I am writing a physics code that mainly manipulates a big (> 1 000 000) set of "particles" (with 6 double coordinates each). I am looking for the best way (in term of performance) to implement a class that will contain a container for these data and that will provide manipulation primitives for these data (e.g. instantiation, operator[], etc.).
There are a few restrictions on how this set is used:
its size is read from a configuration file and won't change during execution
it can be viewed as a big two dimensional array of N (e.g. 1 000 000) lines and 6 columns (each one storing the coordinate in one dimension)
the array is manipulated in a big loop, each "particle / line" is accessed and computation takes place with its coordinates, and the results are stored back for this particle, and so on for each particle, and so on for each iteration of the big loop.
no new elements are added or deleted during the execution
First conclusion, as the access on the elements is essentially done by accessing each element one by one with [], I think that I should use a normal dynamic array.
I have explored a few things, and I would like to have your opinion on the one that can give me the best performances.
As I understand there is no advantage to use a dynamically allocated array instead of a std::vector, so things like double** array2d = new ..., loop of new, etc are ruled out.
So is it a good idea to use std::vector<double> ?
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > my_array that can be indexed like my_array[i][j], or is it a bad idea and it would be better to use std::vector<double> other_array and acces it with other_array[6*i+j].
Maybe this can gives better performance, especially as the number of columns is fixed and known from the beginning.
If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
Another option, the one that I am using so far is to use Blitz, in particular blitz::Array:
typedef blitz::Array<double,TWO_DIMENSIONS> store_t;
store_t my_store;
Where my elements are accessed like that: my_store(line, column);.
I think there are not much advantage to use Blitz in my case because I am accessing each element one by one and that Blitz would be interesting if I was using operations directly on array (like matrix multiplication) which I am not.
Do you think that Blitz is OK, or is it useless in my case ?
These are the possibilities I have considered so far, but maybe the best one I still another one, so don't hesitate to suggest me other things.
Thanks a lot for your help on this problem !
Edit:
From the very interesting answers and comments bellow a good solution seems to be the following:
Use a structure particle (containing 6 doubles) or a static array of 6 doubles (this avoid the use of two dimensional dynamic arrays)
Use a vector or a deque of this particle structure or array. It is then good to traverse them with iterators, and that will allow to change from one to another later.
In addition I can also use a Blitz::TinyVector<double,6> instead of a structure.

So is it a good idea to use std::vector<double> ?
Usually, a std::vector should be the first choice of container. You could use either std::vector<>::reserve() or std::vector<>::resize() to avoid reallocations while populating the vector. Whether any other container is better can be found by measuring. And only by measuring. But first measure whether anything the container is involved in (populating, accessing elements) is worth optimizing at all.
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > [...]?
No. IIUC, you are accessing your data per particle, not per row. If that's the case, why not use a std::vector<particle>, where particle is a struct holding six values? And even if I understood incorrectly, you should rather write a two-dimensional wrapper around a one-dimensional container. Then align your data either in rows or columns - what ever is faster with your access patterns.
Do you think that Blitz is OK, or is it useless in my case?
I have no practical knowledge about blitz++ and the areas it is used in. But isn't blitz++ all about expression templates to unroll loop operations and optimizing away temporaries when doing matrix manipulations? ICBWT.

First of all, you don't want to scatter the coordinates of one given particle all over the place, so I would begin by writing a simple struct:
struct Particle { /* coords */ };
Then we can make a simple one dimensional array of these Particles.
I would probably use a deque, because that's the default container, but you may wish to try a vector, it's just that 1.000.000 of particles means about a single chunk of a few MBs. It should hold but it might strain your system if this ever grows, while the deque will allocate several chunks.
WARNING:
As Alexandre C remarked, if you go the deque road, refrain from using operator[] and prefer to use iteration style. If you really need random access and it's performance sensitive, the vector should prove faster.

The first rule when choosing from containers is to use std::vector. Then, only after your code is complete and you can actually measure performance, you can try other containers. But stick to vector first. (And use reserve() from the start)
Then, you shouldn't use an std::vector<std::vector<double> >. You know the size of your data: it's 6 doubles. No need for it to be dynamic. It is constant and fixed. You can define a struct to hold you particle members (the six doubles), or you can simply typedef it: typedef double particle[6]. Then, use a vector of particles: std::vector<particle>.
Furthermore, as your program uses the particle data contained in the vector sequentially, you will take advantage of the modern CPU cache read-ahead feature at its best performance.

You could go several ways. But in your case, don't declare astd::vector<std::vector<double> >. You're allocating a vector (and you copy it around) for every 6 doubles. Thats way too costly.

If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
(other_array[i,j] won't work too well, as i,j employs the comma operator to evaluate the value of "i", then discards that and evaluates and returns "j", so it's equivalent to other_array[i]).
You will need to use one of:
other_array[i][j]
other_array(i, j) // if other_array implements operator()(int, int),
// but std::vector<> et al don't.
other_array[i].identifier // identifier is a member variable
other_array[i].identifier() // member function getting value
other_array[i].identifier(double) // member function setting value
You may or may not prefer to put get_ and set_ or similar on the last two functions should you find them useful, but from your question I think you won't: functions are prefered in APIs between parts of large systems involving many developers, or when the data items may vary and you want the algorithms working on the data to be independent thereof.
So, a good test: if you find yourself writing code like other_array[i][3] where you've decided "3" is the double with the speed in it, and other_array[i][5] because "5" is the the acceleration, then stop doing that and give them proper identifiers so you can say other_array[i].speed and .acceleration. Then other developers can read and understand it, and you're much less likely to make accidental mistakes. On the other hand, if you are iterating over those 6 elements doing exactly the same things to each, then you probably do want Particle to hold a double[6], or to provide an operator[](int). There's no problem doing both:
struct Particle
{
double x[6];
double& speed() { return x[3]; }
double speed() const { return x[3]; }
double& acceleration() { return x[5]; }
...
};
BTW / the reason that vector<vector<double> > may be too costly is that each set of 6 doubles will be allocated on the heap, and for fast allocation and deallocation many heap implementations use fixed-size buckets, so your small request will be rounded up t the next size: that may be a significant overhead. The outside vector will also need to record a extra pointer to that memory. Further, heap allocation and deallocation is relatively slow - in you're case, you'd only be doing it at startup and shutdown, but there's no particular point in making your program slower for no reason. Even more importantly, the areas on the heap may just around in memory, so your operator[] may have cache-faults pulling in more distinct memory pages than necessary, slowing the entire program. Put another way, vectors store elements contiguously, but the pointed-to-vectors may not be contiguous.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js