Decreasing the overall computation time - c++

So I have a computationally heavy c++ function that extracts numbers from a file and puts them into a vector. When I run this function in main, it takes a lot of time. Is it possible to somehow have this function computed once, and then linked to the main program so I can save precious computation time in my main program every time I try to run it?
The function I have is this:
vector <double> extract (vector <double> foo)
{
ifstream wlm;
wlm.open("wlm.dat");
if(wlm.is_open())
{
while (!wlm.eof())
{
//blah blah extraction stuff
}
return foo;
}
else
cout<<"File didn't open"<<endl;
wlm.close();
}
And my main program has other stuff which I compute over there. I don't want to call this function from the main program because it will take a long time. Instead I want the vector to be extracted beforehand during compile time so I can use the extracted vector later in my main program. Is this possible?

Change your function to that:
std::vector<double>& extract(std::vector<double>& foo)
So you will not copy vector twice (I guess that eats most of time).
Try to reserve() memory for your vector according to file data (if that is possible, that will let you avoid reallocations).
You should return std::vector<double> always, not just in case of good result.
You should close file just if it was successfully opened.
Something like that:
std::vector<double>& extract (std::vector<double>& foo)
{
ifstream wlm;
wlm.open("wlm.dat");
if(wlm.is_open())
{
while (!wlm.eof())
{
//blah blah extraction stuff
}
wlm.close();
}
else
cout<<"File didn't open"<<endl;
return foo;
}

While your question was not entirely clear, I assume that you want to:
compute a vector of doubles from a large set of data
use this computed (smaller) set of data in your program
do the computation at compile time
This is possible of course, but you will have to leverage whatever build system you are using. Without more specifics, I can only give a general answer:
Create a helper program that you can invoke during compilation. This program should implement the extract function and dump the result into a file. You have two main choices here: go for a resource file that can be embedded into the executable, or generate source code that contains the data. If the data is not terribly large, I suggest the latter.
Use the generated file in your program
For example:
Pre-build step extract_data.exe extracted_data_generated
This dumps the extracted data into a header and source, such as:
// extracted_data_generated.h
#pragma once
extern const std::array<double, 4> extracted;
// extracted_data_generated.cpp
#include "extracted_data_generated.h"
const std::array<double, 4> extracted{ { 1.2, 3.4, 5.6, 6.7 } }; //etc.
In other parts of your program, use the generated data
#include "extracted_data_generated.h"
// you have extracted available as a variable here.
I also changed to a std::array whose size you will know in your helper program because you will know the size of the vector.
The resource route is similar, but you will have to implement platform-specific extraction of the resource and reading the data. So unless your computed data is very large, I'd suggest the code generation.

Related

saving object information into a binary file

I m trying to save all the member variables of an object in a binary file. However, the member variables are vectors that is dynamically allocated. So, is there any way to combine all the data and save it in a binary file. As of now, it just saves the pointer, which is of little help. Following is my running code.
#include <vector>
#include <iostream>
#include <fstream>
class BaseSaveFile {
protected:
std::vector<float> first_vector;
public:
void fill_vector(std::vector<float> fill) {
first_vector = fill;
}
void show_vector() {
for ( auto x: first_vector )
std::cout << x << std::endl;
}
};
class DerivedSaveFile : public BaseSaveFile {
};
int main ( int argc, char **argv) {
DerivedSaveFile derived;
std::vector<float> fill;
for ( auto i = 0; i < 10; i++) {
fill.push_back(i);
}
derived.fill_vector(fill);
derived.show_vector();
std::ofstream save_object("../save_object.bin", std::ios::out | std::ios::binary);
save_object.write((char*)&derived, sizeof(derived));
}
Currently size of the binary file is just 24 bytes. But I was execting much larger because of the vector of 10 floats.
"is there any way to combine all the data and save it in a binary file" - of course there is. You write code to iterate over all the data and convert it into a form suitable for writing to a file (that you know how to later parse when reading it back in). Then you write code to read the file, parse it into meaningful variables classes and construct new objects from the read-in data. There's no built-in facility for it, but it's not rocket science - just a bunch of work/code you need to do.
It's called serialisation/de-serialisation btw, in case you want to use your preferred search engine to look up more details.
The problem
You can write the exact binary content of an object to a file:
save_object.write((char*)&derived, sizeof(derived));
However, it is not guaranteed that you you read it back into memory with the reverse read operation. This is only possible for a small subset of objects that have a trivially copyable type and do not contain any pointer.
You can verify if your type matches this definition with std::is_trivially_copyable<BaseSaveFile>::value but I can already tell you that it's not because of the vector.
To simplify a bit the formal definition, trivially copyable types are more or less the types that are composed only of other trivially copiable elements and very elementary data types such as int, float, char, or fixed-size arrays.
The solution: introduction to serialization
The general solution, as mentionned int he other response it called serialization. But for a more tailored answer, here is how it would look like.
You would add the following public method to your type:
std::ostream& save(std::ostream& os){
size_t vsize=first_vector.size();
os.write((char*)&vsize, sizeof(vsize));
os.write((char*)first_vector.data(), vsize*sizeof(float));
return os;
}
This method has access to all the members and can write them to the disk. For the case of the vector, you'd first write down its size (so that you know how big it is when you'll read the file later on).
You would then add the reverse method:
std::istream& load(std::istream& is){
size_t vsize;
if(is.read((char*)&vsize, sizeof(vsize))) {
first_vector.resize(vsize);
is.read((char*)first_vector.data(), vsize*sizeof(float));
}
return is;
}
Here the trick is to first read the size of the vector on disk, and then resize the vector before loading it.
Note the use of istream and ostream. This allows you to store the data on a file, but you could use any other kind of stream such as in memory string stream if you want.
Here a full online example (it uses stringstream because the online service doesn't provide for files to be written).
More serialization ?
There are some serialization tricks to know. First, if you have derived types, you'd need to make load() and save() virtual and provide the derived types with their own overridden version.
If one of your data member is not trivially copyable, it would need its own load() and save() that you could then invoke recursively. Or you'd need to handle the thing yourself, which is only possible if you can access all the members you'd need to restore its state.
Finally, you don't need to reinvent the wheel. There are some libraries outside that may help, like boost serialisation or cereal

C++ - Efficient way to group double vectors following a certain criteria

I have a list of objects saved in a CSV-like file using the following scheme:
[value11],...,[value1n],[label1]
[value21],...,[value2n],[label2]
...
[valuen1],...,[valuenn],[labeln]
(each line is a single object, i.e. a vector of doubles and the respective label).
I would like to collect them in groups with a certain custom criteria (i.e. same values at n-th and (n+1)-th position of all objects of that group). And i need to do that in the most efficient way, since the text file contains hundreds of thounsands of objects. I'm using the C++ programming language.
To do so, firstly I load all the CSV lines in a simple custom container (with getObject, getLabel and import methods). Then i use the following code to read them and make groups. "verifyGroupRequirements" is a function which returns true if the group conditions are satisfied, false otherwise.
for (size_t i = 0; i < ObjectsList.getSize(); ++i) {
MyObject currentObj;
currentObj.attributes = ObjectsList.getObject(i);
currentObj.label = ObjectsList.getLabel(i);
if (i == 0) {
// Sequence initialization with the first object
ObjectsGroup currentGroup = ObjectsGroup();
currentGroup.objectsList.push_back(currentObj);
tmpGroupList.push_back(currentGroup);
} else {
// if it is not the first pattern, then we check sequence conditions
list<ObjectsGroup>::iterator it5;
for (it5 = tmpGroupList.begin(); it5 != tmpGroupList.end(); ++it5) {
bool AddObjectToGroupRequirements =
verifyGroupRequirements(it5->objectsList.back(), currentObj) &
( (it5->objectsList.size() < maxNumberOfObjectsPerGroup) |
(maxNumberOfObjectsPerGroup == 0) );
if (AddObjectToGroupRequirements) {
// Object added to the group
it5->objectsList.push_back(currentObj);
break;
} else {
// If we can't find a group which satisfy those conditions and we
// arrived at the end of the list of groups, then we create a new
// group with that object.
size_t gg = std::distance(it5, tmpGroupList.end());
if (gg == 1) {
ObjectsGroup tmp1 = ObjectsGroup();
tmp1.objectsList.push_back(currentObj);
tmpGroupList.push_back(tmp1);
break;
}
}
}
}
if (maxNumberOfObjectsPerGroup > 0) {
// With a for loop we can take all the elements of
// tmpGroupList which have reached the maximum size
list<ObjectsGroup>::iterator it2;
for (it2 = tmpGroupList.begin(); it2 != tmpGroupList.end(); ++it2) {
if (it2->objectsList.size() == maxNumberOfObjectsPerGroup)
finalGroupList.push_back(*it2);
}
// Since tmpGroupList is a list we can use remove_if to remove them
tmpGroupList.remove_if(rmCondition);
}
}
if (maxNumberOfObjectsPerGroup == 0)
finalGroupList = vector<ObjectsGroup> (tmpGroupList.begin(), tmpGroupList.end());
else {
list<ObjectsGroup>::iterator it6;
for (it6 = tmpGroupList.begin(); it6 != tmpGroupList.end(); ++it6)
finalGroupList.push_back(*it6);
}
Where tmpGroupList is a list<MyObject>, finalGroupList is a vector<MyObject> and rmCondition is a boolean function that returns true if the size of a ObjectsGroup is bigger than a fixed value. MyObject and ObjectsGroup are two simple data structures, written in the following way:
// Data structure of the single object
class MyObject {
public:
MyObject(
unsigned short int &spaceToReserve,
double &defaultContent,
string &lab) {
attributes = vector<double>(spaceToReserve, defaultContent);
label = lab;
}
vector<double> attributes;
string label;
};
// Data structure of a group of object
class ObjectsGroup {
public:
list<MyObject> objectsList;
double health;
};
This code seems to work, but it is really slow. Since, as i said before, i have to apply it on a large set of objects, is there a way to improve it and make it faster? Thanks.
[EDIT] What I'm trying to achieve is to make groups of objects where each object is a vector<double> (got from a CSV file). So what I'm asking here is, is there a more efficient way to collect those kind of objects in groups than what is exposed in the code example above?
[EDIT2] I need to make groups using all of those vectors.
So, I'm reading your question...
... I would like to collect them in groups with a certain custom
criteria (i.e. same values at n-th and (n+1)-th position of all
objects of that group) ...
Ok, I read this part, and kept on reading...
... And i need to do that in the most efficient way, since the text file
contains hundreds of thounsands of objects...
I'm still with you, makes perfect sense.
... To do so, firstly I load all the CSV lines ...
{thud} {crash} {loud explosive noises}
Ok, I stopped reading right there, and didn't pay much attention to the rest of the question, including the large code sample. This is because we have a basic problem right from the start:
1) You say that your intention is, typically, to read only a small
portion of this huge CSV file, and...
2) ... to do that you load the entire CSV file, into a fairly sophisticated data structure.
These two statements are at odds with each. You're reading a huge number of values from a file. You are creating an object for each value. Based on the premise of your question, you're going to have a large number of these objects. But then, when all is said and done, you're only going to look at a small number of them, and throw the rest away?
You are doing a lot of work, presumably using up a lot of memory, and CPU cycles, loading a huge data set, only to ignore most of it. And you are wondering why you're having performance issues? Seems pretty cut and dry to me.
What would be an alternative way of doing this? Well, let's turn this whole problem inside out, and approach it piecemeal. Let's read a CSV file, one line at a time, parse the values in the CSV-formatted file, and pass the resulting strings to a lambda.
Something like this:
template<typename Callback> void parse_csv_lines(std::ifstream &i,
Callback &&callback)
{
std::string line;
while (1)
{
line.clear();
std::getline(i, line);
// Deal with missing newline on the last line...
if (i.eof() && line.empty())
break;
std::vector<std::string> words;
// At this point, you'll take this "line", and split it apart, at
// the commas, into the individual words. Parsing a CSV-
// formatted file. Not very exciting, you're doing this
// already, the algorithm is boring to implement, you know
// how to do it, so let's say you replace this entire comment
// with your boiler-plate CSV parsing logic from your existing
// code
callback(words);
}
}
Ok, now we've done that task of parsing the CSV file. Now, let's say we want to do the task you've set out in the beginning of your question, grab every nth and n+1th position. So...
void do_something_with_n_and_nplus1_words(size_t n)
{
std::ifstream input_file("input_file.csv");
// Insert code to check if input_file.is_open(), and if not, do
// whatever
parse_csv_lines(input_file,
[n]
(const auto &words)
{
// So now, grab words[n] and words[n+1]
// (after checking, of course, for a malformed
// CSV file with fewer than "n+2" values)
// and do whatever you want with them.
});
}
That's it. Now, you end up simply reading the CSV file, and doing the absolute minimum amount of work required to extract the nth and the n+1th values from each CSV file. It's going to be fairly difficult to come up with an approach that does less work (except, of course, micro-optimizations related to CSV parsing, and word buffers; or perhaps foregoing the overhead of std::ifstream, but rather mmap-ing the entire file, and then parsing it out by scanning its mmap-ed contents, something like that), I'd think.
For other similar one-off tasks, requiring only a small number of values from the CSV files, just write an appropriate lambda to fetch them out.
Perhaps, you need to retrieve two or more subsets of values from the large CSV file, and you want to read the CSV file once, maybe? Well, hard to give the best general approach. Each one of these situations will require individual analysis, to pick the best approach.

Read doubles out of .txt file into double array c++

I'm attempting to create a program that takes a large number of stock prices. I have these prices stored in a .txt file with one double per row. There are an unknown number (probably thousands). I cannot get the data into an array which I can manipulate. I've been unable to solve this problem for a couple hours now. Whenever I attempt to read the data out of the file and then convert it to a double, I get weird thread errors and the program hangs. Can anyone show me how to read an unknown amount of doubles into an array.
string line;
vector<double> doubles;
fstream myfile ("/Users/jaychinnaswamy/Documents/PMHistory.txt",std::ios_base::in);
int x=0;
float a;
while (myfile >> a)
{
doubles[x]=a;
}
An example of the file structure is:
50.4000000000000
50.8000000000000
50.5000000000000
50.2100000000000
49.1500000000000
48.5000000000000
Thanks
You have created an empty vector here.
vector<double> doubles;
And here, you are indexing into the vector, as if it's not empty. It's still empty and you are accessing an invalid element.
doubles[x]=a;
Change that code to use std::vector::push_back().
doubles.push_back(a);
Two things obvious from the code I see (already mentioned in comments):
You don't increment x.
Even w/o increment, you access doubles[x] with x==0 when doubles has no elements. That's undefined behavior. You need to allocate elements ahead of time or use push_back to grow vector as needed.
The code you have submitted has a few issues. As stated by others, you have instantiated an empty std::vector and your code does assignments onto an index that does not exist. Since your problem revolves around an unknown amount of elements (dynamic size), use the push_back method of std::vector. I would advise you not convert from float to double through a. You know the data is a double and you want to store it as a double, so use double. Since this is just a piece of your code, you can utilize a for loop to make double a scope bound to the loading of the data.
Your code can be rewritten into:
std::ifstream myfile("/Users/jaychinnaswamy/Documents/PMHistory.txt");
std::vector<double> doubles;
for (double a; myfile >> a;) {
doubles.push_back(a);
}
However, if you don't require data verification, you know the file contains only doubles, and you really just want to import the numbers, then use the range constructor for std::vector to read the file for you via std::istream_iterator.
#include <fstream>
#include <iostream>
#include <iterator>
#include <vector>
int main() {
std::ifstream myfile("/Users/jaychinnaswamy/Documents/PMHistory.txt");
std::vector<double> doubles{(std::istream_iterator<double>(myfile)),
std::istream_iterator<double>()};
}
Pay special attention to the () around the first iterator argument of the std::vector range constructor. Without the () will result in syntactic ambiguity, also known as the "most vexing parse".

C++ newbie: writing a function for extracting i vectors from a file. How do I get i/unkown number of vectors out of the function?

I am writing a function for getting datasets from a file and putting them into vectors. The datasets are then used in a calculation. In the file, a user writes each dataset on a line under a heading like 'Dataset1'. The result is i vectors by the time the function finishes executing. The function works just fine.
The problem is that I don't know how to get the vectors out of the function! (1) I think I can only return one entity from a function. So I can't return i vectors. Also, (2) I can't write the vectors/datasets as function parameters and return them by reference because the number of vectors/datasets is different for each calculation. If there are other possibilities, I am unaware of them.
I'm sure this is a silly question, but am I missing something here? I would be very grateful for any suggestions. Until now, I have not put the vector/dataset extraction code into a function; I have kept it in my main file, where it has worked fine. I would now like to clean up my code by putting all data extraction code into its own function.
For each calculation, I DO know the number of vectors/datasets that the function will find in the file because I have that information written in the file and can extract it. Is there some way I could use this information?
If each vector is of the same type you can return a
std::vector<std::vector<datatype> >
This would look like:
std::vector<std::vector<datatype> > function(arguments) {
std::vector<std::vector<datatype> > return_vector;
for(int i =0; i < rows; ++i) {
\\ do processing
return_vector.push_back(resulting_vector);
}
return return_vector;
}
As has been mentionned, you may simply use a vector of vectors.
In addition, you may want to add a smart pointer around it, just to make sure you're not copying the contents of your vectors (but that's already an improvement. First aim at something that works).
As for the information on the number of vectors, you may use it by resizing the global vector to the appropriate value.
You question is, at its essence "How do I return a pile of things from a function?" It happens that your things are vector<double>, but that's not really important. What is important is that you have a pile of them of unknown size.
You can refine your thinking by rephrasing your one question into two:
How do I represent a pile of things?
How do I return that representation from a function?
As to the first question, this is precisely what containers do. Containers, as you surely know because you are already using one, hold an arbitrary numbers of similar objects. Examples include std::vector<T> and std::list<T>, among others. Your choice of which container to use is dictated by circumstances you haven't mentioned -- for example, how expensive are the items to copy, do you need to delete an item from middle of the pile, etc.
In your specific case, knowing what little we know, it seems you should use std::vector<>. As you know, the template parameter is the type of the thing you want to store. In your case that happens to be (coincidentally), an std::vector<double>. (The fact that the container and its contained object happen to be similar types is of no consequence. If you need a pile of Blobs or Widgets, you say std::vector<Blob> or std::vector<Widget>. Since you need a pile of vector<double>s, you say vector<vector<double> >.) So you would declare it thus:
std::vector<std::vector<double > > myPile;
(Notice the space between > and >. That space is required in the previous C++ standard.)
You build up that vector just as you did your vector<double> -- either using generic algorithms, or invoking push_back, or some other way. So, your code would look like this:
void function( /* args */ ) {
std::vector<std::vector<double> > myPile;
while( /* some condition */ ) {
std::vector<double> oneLineOfData;
/* code to read in one vector */
myPile.push_back(oneLineOfData);
}
}
In this manner, you collect all of the incoming data into one structure, myPile.
As to the second question, how to return the data. Well, that's simple -- use a return statement.
std::vector<std::vector<double> > function( /* args */ ) {
std::vector<std::vector<double> > myPile;
/* All of the useful code goes here*/
return myPile;
}
Of course, you could also return the information via a passed-in reference to your vector:
void function( /* args */, std::vector<std::vector<double> >& myPile)
{
/* code goes here. including: */
myPile.push_back(oneLineOfData);
}
Or via a passed-in pointer to your vector:
void function( /* args */, std::vector<std::vector<double> >* myPile)
{
/* code goes here. */
myPile->push_back(oneLineOfData);
}
In both of those cases, the caller must create the vector-of-vector-of-double before invoking your function. Prefer the first (return) way, but if your program design dictates, you can use the other ways.

Simulation design - flow of data, coupling

I am writing a simulation and need some hint on the design. The basic idea is that data for the given stochastic processes is being generated and later on consumed for various calculations. For example for 1 iteration:
Process 1 -> generates data for source 1: x1
Process 2 -> generates data for source 1: x2
and so on
Later I want to apply some transformations for example on the output of source 2, which results in x2a, x2b, x2c. So in the end up with the following vector: [x1, x2a, x2b, x2c].
I have a problem, as for N-multivariate stochastic processes (representing for example multiple correlated phenomenons) I have to generate N dimensional sample at once:
Process 1 -> generates data for source 1...N: x1...xN
I am thinking about the simple architecture that would allow to structuralize the simulation code and provide flexibility without hindering the performance.
I was thinking of something along these lines (pseudocode):
class random_process
{
// concrete processes would generate and store last data
virtual data_ptr operator()() const = 0;
};
class source_proxy
{
container_type<process> processes;
container_type<data_ptr> data; // pointers to the process data storage
data operator[](size_type number) const { return *(data[number]);}
void next() const {/* update the processes */}
};
Somehow I am not convinced about this design. For example, if I'd like to work with vectors of samples instead of single iteration, then above design should be changed (I could for example have the processes to fill the submatrices of the proxy-matrix passed to them with data, but again not sure if this is a good idea - if yes then it would also fit nicely the single iteration case). Any comments, suggestions and criticism are welcome.
EDIT:
Short summary of the text above to summarize the key points and clarify the situation:
random_processes contain the logic to generate some data. For example it can draw samples from multivariate random gaussian with the given means and correlation matrix. I can use for example Cholesky decomposition - and as a result I'll be getting a set of samples [x1 x2 ... xN]
I can have multiple random_processes, with different dimensionality and parameters
I want to do some transformations on individual elements generated by random_processes
Here is the dataflow diagram
random_processes output
x1 --------------------------> x1
----> x2a
p1 x2 ------------transform|----> x2b
----> x2c
x3 --------------------------> x3
p2 y1 ------------transform|----> y1a
----> y1b
The output is being used to do some calculations.
When I read this "the answer" doesn't materialize in my mind, but instead a question:
(This problem is part of a class of problems that various tool vendors in the market have created configurable solutions for.)
Do you "have to" write this or can you invest in tried and proven technology to make your life easier?
In my job at Microsoft I work with high performance computing vendors - several of which have math libraries. Folks at these companies would come much closer to understanding the question than I do. :)
Cheers,
Greg Oliver [MSFT]
I'll take a stab at this, perhaps I'm missing something but it sounds like we have a list of processes 1...N that don't take any arguments and return a data_ptr. So why not store them in a vector (or array) if the number is known at compile time... and then structure them in whatever way makes sense. You can get really far with the stl and the built in containers (std::vector) function objects(std::tr1::function) and algorithms (std::transform)... you didn't say much about the higher level structure so I'm assuming a really silly naive one, but clearly you would build the data flow appropriately. It gets even easier if you have a compiler with support for C++0x lambdas because you can nest the transformations easier.
//compiled in the SO textbox...
#include <vector>
#include <functional>
#include <numerics>
typedef int data_ptr;
class Generator{
public:
data_ptr operator()(){
//randomly generate input
return 42 * 4;
}
};
class StochasticTransformation{
public:
data_ptr operator()(data_ptr in){
//apply a randomly seeded function
return in * 4;
}
};
public:
data_ptr operator()(){
return 42;
}
};
int main(){
//array of processes, wrap this in a class if you like but it sounds
//like there is a distinction between generators that create data
//and transformations
std::vector<std::tr1::function<data_ptr(void)> generators;
//TODO: fill up the process vector with functors...
generators.push_back(Generator());
//transformations look like this (right?)
std::vector<std::tr1::function<data_ptr(data_ptr)> transformations;
//so let's add one
transformations.push_back(StochasticTransformation);
//and we have an array of results...
std::vector<data_ptr> results;
//and we need some inputs
for (int i = 0; i < NUMBER; ++i)
results.push_back(generators[0]());
//and now start transforming them using transform...
//pick a random one or do them all...
std::transform(results.begin(),results.end(),
results.begin(),results.end(),transformation[0]);
};
I think that the second option (the one mentioned in the last paragraph) makes more sense. In the one you had presented you are playing with pointers and indirect access to random process data. The other one would store all the data (either vector or a matrix) in one place - the source_proxy object. The random processes objects are then called with a submatrix to populate as a parameter, and themselves they do not store any data. The proxy manages everything - from providing the source data (for any distinct source) to requesting new data from the generators.
So changing a bit your snippet we could end up with something like this:
class random_process
{
// concrete processes would generate and store last data
virtual void operator()(submatrix &) = 0;
};
class source_proxy
{
container_type<random_process> processes;
matrix data;
data operator[](size_type source_number) const { return a column of data}
void next() {/* get new data from the random processes */}
};
But I agree with the other comment (Greg) that it is a difficult problem, and depending on the final application may require heavy thinking. It's easy to go into the dead-end resulting in rewriting lots of code...