I have a list of objects saved in a CSV-like file using the following scheme:
[value11],...,[value1n],[label1]
[value21],...,[value2n],[label2]
...
[valuen1],...,[valuenn],[labeln]
(each line is a single object, i.e. a vector of doubles and the respective label).
I would like to collect them in groups with a certain custom criteria (i.e. same values at n-th and (n+1)-th position of all objects of that group). And i need to do that in the most efficient way, since the text file contains hundreds of thounsands of objects. I'm using the C++ programming language.
To do so, firstly I load all the CSV lines in a simple custom container (with getObject, getLabel and import methods). Then i use the following code to read them and make groups. "verifyGroupRequirements" is a function which returns true if the group conditions are satisfied, false otherwise.
for (size_t i = 0; i < ObjectsList.getSize(); ++i) {
MyObject currentObj;
currentObj.attributes = ObjectsList.getObject(i);
currentObj.label = ObjectsList.getLabel(i);
if (i == 0) {
// Sequence initialization with the first object
ObjectsGroup currentGroup = ObjectsGroup();
currentGroup.objectsList.push_back(currentObj);
tmpGroupList.push_back(currentGroup);
} else {
// if it is not the first pattern, then we check sequence conditions
list<ObjectsGroup>::iterator it5;
for (it5 = tmpGroupList.begin(); it5 != tmpGroupList.end(); ++it5) {
bool AddObjectToGroupRequirements =
verifyGroupRequirements(it5->objectsList.back(), currentObj) &
( (it5->objectsList.size() < maxNumberOfObjectsPerGroup) |
(maxNumberOfObjectsPerGroup == 0) );
if (AddObjectToGroupRequirements) {
// Object added to the group
it5->objectsList.push_back(currentObj);
break;
} else {
// If we can't find a group which satisfy those conditions and we
// arrived at the end of the list of groups, then we create a new
// group with that object.
size_t gg = std::distance(it5, tmpGroupList.end());
if (gg == 1) {
ObjectsGroup tmp1 = ObjectsGroup();
tmp1.objectsList.push_back(currentObj);
tmpGroupList.push_back(tmp1);
break;
}
}
}
}
if (maxNumberOfObjectsPerGroup > 0) {
// With a for loop we can take all the elements of
// tmpGroupList which have reached the maximum size
list<ObjectsGroup>::iterator it2;
for (it2 = tmpGroupList.begin(); it2 != tmpGroupList.end(); ++it2) {
if (it2->objectsList.size() == maxNumberOfObjectsPerGroup)
finalGroupList.push_back(*it2);
}
// Since tmpGroupList is a list we can use remove_if to remove them
tmpGroupList.remove_if(rmCondition);
}
}
if (maxNumberOfObjectsPerGroup == 0)
finalGroupList = vector<ObjectsGroup> (tmpGroupList.begin(), tmpGroupList.end());
else {
list<ObjectsGroup>::iterator it6;
for (it6 = tmpGroupList.begin(); it6 != tmpGroupList.end(); ++it6)
finalGroupList.push_back(*it6);
}
Where tmpGroupList is a list<MyObject>, finalGroupList is a vector<MyObject> and rmCondition is a boolean function that returns true if the size of a ObjectsGroup is bigger than a fixed value. MyObject and ObjectsGroup are two simple data structures, written in the following way:
// Data structure of the single object
class MyObject {
public:
MyObject(
unsigned short int &spaceToReserve,
double &defaultContent,
string &lab) {
attributes = vector<double>(spaceToReserve, defaultContent);
label = lab;
}
vector<double> attributes;
string label;
};
// Data structure of a group of object
class ObjectsGroup {
public:
list<MyObject> objectsList;
double health;
};
This code seems to work, but it is really slow. Since, as i said before, i have to apply it on a large set of objects, is there a way to improve it and make it faster? Thanks.
[EDIT] What I'm trying to achieve is to make groups of objects where each object is a vector<double> (got from a CSV file). So what I'm asking here is, is there a more efficient way to collect those kind of objects in groups than what is exposed in the code example above?
[EDIT2] I need to make groups using all of those vectors.
So, I'm reading your question...
... I would like to collect them in groups with a certain custom
criteria (i.e. same values at n-th and (n+1)-th position of all
objects of that group) ...
Ok, I read this part, and kept on reading...
... And i need to do that in the most efficient way, since the text file
contains hundreds of thounsands of objects...
I'm still with you, makes perfect sense.
... To do so, firstly I load all the CSV lines ...
{thud} {crash} {loud explosive noises}
Ok, I stopped reading right there, and didn't pay much attention to the rest of the question, including the large code sample. This is because we have a basic problem right from the start:
1) You say that your intention is, typically, to read only a small
portion of this huge CSV file, and...
2) ... to do that you load the entire CSV file, into a fairly sophisticated data structure.
These two statements are at odds with each. You're reading a huge number of values from a file. You are creating an object for each value. Based on the premise of your question, you're going to have a large number of these objects. But then, when all is said and done, you're only going to look at a small number of them, and throw the rest away?
You are doing a lot of work, presumably using up a lot of memory, and CPU cycles, loading a huge data set, only to ignore most of it. And you are wondering why you're having performance issues? Seems pretty cut and dry to me.
What would be an alternative way of doing this? Well, let's turn this whole problem inside out, and approach it piecemeal. Let's read a CSV file, one line at a time, parse the values in the CSV-formatted file, and pass the resulting strings to a lambda.
Something like this:
template<typename Callback> void parse_csv_lines(std::ifstream &i,
Callback &&callback)
{
std::string line;
while (1)
{
line.clear();
std::getline(i, line);
// Deal with missing newline on the last line...
if (i.eof() && line.empty())
break;
std::vector<std::string> words;
// At this point, you'll take this "line", and split it apart, at
// the commas, into the individual words. Parsing a CSV-
// formatted file. Not very exciting, you're doing this
// already, the algorithm is boring to implement, you know
// how to do it, so let's say you replace this entire comment
// with your boiler-plate CSV parsing logic from your existing
// code
callback(words);
}
}
Ok, now we've done that task of parsing the CSV file. Now, let's say we want to do the task you've set out in the beginning of your question, grab every nth and n+1th position. So...
void do_something_with_n_and_nplus1_words(size_t n)
{
std::ifstream input_file("input_file.csv");
// Insert code to check if input_file.is_open(), and if not, do
// whatever
parse_csv_lines(input_file,
[n]
(const auto &words)
{
// So now, grab words[n] and words[n+1]
// (after checking, of course, for a malformed
// CSV file with fewer than "n+2" values)
// and do whatever you want with them.
});
}
That's it. Now, you end up simply reading the CSV file, and doing the absolute minimum amount of work required to extract the nth and the n+1th values from each CSV file. It's going to be fairly difficult to come up with an approach that does less work (except, of course, micro-optimizations related to CSV parsing, and word buffers; or perhaps foregoing the overhead of std::ifstream, but rather mmap-ing the entire file, and then parsing it out by scanning its mmap-ed contents, something like that), I'd think.
For other similar one-off tasks, requiring only a small number of values from the CSV files, just write an appropriate lambda to fetch them out.
Perhaps, you need to retrieve two or more subsets of values from the large CSV file, and you want to read the CSV file once, maybe? Well, hard to give the best general approach. Each one of these situations will require individual analysis, to pick the best approach.
Related
I am new to interfaces with databases through c++ and was wondering what is the best approach to do the following:
I have an object with member variables that I define ahead of time, and member variables that I need to pull from a database given the known variables. For example:
class DataObject
{
public:
int input1;
string input2;
double output1;
DataObject(int Input1, string Input2) :
input1(Input1), input2(Input2)
{
output1 = Initializer(input1,input2);
}
private:
Initializer(int, string);
static RecordSet rs; //I am just guessing the object would be called RecordSet
}
Now, I can do something like:
std::vector<DataObject> v;
for (int n = 0; n <= 10; ++n)
for (char w = 'a'; w <= 'z'; ++w)
v.push_back(DataObject{n,z});
And get an initialized vector of DataObjects. Behind the scenes, Initializer will check check if rs already has data. If not, it will connect to the database and query something like: select input1, input2, output1 from ... where input1 between 1 and 10 and input 2 between 'a' and 'z', and then start initializing each DataObject with output1 given each pair of input1 and input2.
This would be utterly simple in C#, but from code samples I have found online it looks utterly ugly in C++. I am stuck on two things. As stated earlier, I am completely new to database interfaces in C++, and there are so many methods from which to choose, but I would like to hone in on a specific method that truly fits my purpose. Furthermore - and this is the purpose - I am trying to make use of a static data set to pull data in a single query, rather than run a new query for each input1/input2 combination; even better yet, is there a way to have database results written directly into the newly created DataObjects rather than making a pit stop in some temporary RecordSet object.
To summarize and clarify: I have database on a relational database, and I am trying to pull the data and store it into a collection of objects. How do I do this? Any tips/direction - I am much obliged.
EDIT 8/16/17: After some research and trials I have come up with the below
So I've had progress by using an ADORecordset with the put_CursorLocation set to adUseServer:
rs->put_CursorLocation(adUseServer)
My understanding is that by using this setting the query result is stored on the server, and the client side only gets the current row pointed to by rs.
So I get my data from the row and create the DataObject on the spot, emplace_back it into the vector, and finally call rs->MoveNext() to get the next row and repeat until I reach the end. Partial example as follows:
std::vector<DataObject> v;
DataObject::rs.Open(connString,Sql); // Connection for wrapper class
for (int n = 0; n <= 10; ++n)
for (char w = 'a'; w <= 'z'; ++w)
v.emplace_back(DataObject{n,z});
// Somewhere else...
void DataObject::Initializer(int a, string b) {
int ra; string rb; double rc;
// For simplicity's sake, let's assume the result set is ordered
// in the same way as the for-loop, and that no data is missing.
// So the below sanity-check would be unnecessary, but included.
while (!rs.IsEOF())
{
// Let's assume I defined these 'Get' functions
ra = rs.Get<int>("Input1");
rb = rs.Get<string>("Input2");
rc = rs.Get<double>("Output1");
rs.MoveNext();
if (ra == a && rb == b) break;
}
return rc;
}
// Constructor for RecordSet:
RecordSet::RecordSet()
{
HRESULT hr = rs_.CoCreateInstance(CLSID_CADORecordset);
ATLENSURE_SUCCEEDED(hr);
rs_->put_CursorLocation(adUseServer);
}
Now I'm hoping that I interpreted how this works correctly; otherwise, this would be a whole lot of fuss over nothing. I am not an ADO or .Net expert - clearly - but I'm hoping someone can chime in to confirm that this is indeed how this works, and perhaps shed some more light on the topic. On my end, I tested the memory usage using VS2015's diagnostic tool, and the heap seems to be significantly larger when using adUseClient. If my conjecture is correct, then why would anyone opt to use adUseClient, or any of the other choices, over adUseServer.
I think of two options: by member type and BLOB.
For classes, I recommend one row per class instance with one column per member. Search the supported data types by your database. There are some common types.
Another method is to use the BLOB (Binary Large OBject) data type. This is a "binary" data type used for storing data-as-is.
You can use the BLOB type for members that are of unsupported data types.
You can get more complicated by researching "Database Normalization" or "Database normal forms".
I have 2 structs, one simply has 2 values:
struct combo {
int output;
int input;
};
And another that sorts the input element based on the index of the output element:
struct organize {
bool operator()(combo const &a, combo const &b)
{
return a.input < b.input;
}
};
Using this:
sort(myVector.begin(), myVector.end(), organize());
What I'm trying to do with this, is iterate through the input varlable, and check if each element is equal to another input 'in'.
If it is equal, I want to insert the value at the same index it was found to be equal at for input, but from output into another temp vector.
I originally went with a more simple solution (when I wasn't using a structs and simply had 2 vectors, one input and one output) and had this in a function called copy:
for(int i = 0; i < input.size(); ++i){
if(input == in){
temp.push_back(output[i]);
}
}
Now this code did work exactly how I needed it, the only issue is it is simply too slow. It can handle 10 integer inputs, or 100 inputs but around 1000 it begins to slow down taking an extra 5 seconds or so, then at 10,000 it takes minutes, and you can forget about 100,000 or 1,000,000+ inputs.
So, I asked how to speed it up on here (just the function iterator) and somebody suggested sorting the input vector which I did, implemented their suggestion of using upper/lower bound, changing my iterator to this:
std::vector<int>::iterator it = input.begin();
auto lowerIt = std::lower_bound(input.begin(), input.end(), in);
auto upperIt = std::upper_bound(input.begin(), input.end(), in);
for (auto it = lowerIt; it != upperIt; ++it)
{
temp.push_back(output[it - input.begin()]);
}
And it worked, it made it much faster, I still would like it to be able to handle 1,000,000+ inputs in seconds but I'm not sure how to do that yet.
I then realized that I can't have the input vector sorted, what if the inputs are something like:
input.push_back(10);
input.push_back(-1);
output.push_back(1);
output.push_back(2);
Well then we have 10 in input corresponding to 1 in output, and -1 corresponding to 2. Obviously 10 doesn't come before -1 so sorting it smallest to largest doesn't really work here.
So I found a way to sort the input based on the output. So no matter how you organize input, the indexes match each other based on what order they were added.
My issue is, I have no clue how to iterate through just input with the same upper/lower bound iterator above. I can't seem to call upon just the input variable of myVector, I've tried something like:
std::vector<combo>::iterator it = myVector.input.begin();
But I get an error saying there is no member 'input'.
How can I iterate through just input so I can apply the upper/lower bound iterator to this new way with the structs?
Also I explained everything so everyone could get the best idea of what I have and what I'm trying to do, also maybe somebody could point me in a completely different direction that is fast enough to handle those millions of inputs. Keep in mind I'd prefer to stick with vectors because not doing so would involve me changing 2 other files to work with things that aren't vectors or lists.
Thank you!
I think that if you sort it in smallest to largest (x is an integer after all) that you should be able to use std::adjacent_find to find duplicates in the array, and process them properly. For the performance issues, you might consider using reserve to preallocate space for your large vector, so that your push back operations don't have to reallocate memory as often.
I have a binary format which is build up like that:
magic number
name size blob
name size blob
name size blob
...
it is build up to easy move through the file and find the right entry. But I would like also to remove an entry (let's call it a chunk as it is one). I guess I can use std::copy/memmove with some iostream iterators to move the chunks behind the one to delete and copy them over the chunk to delete. But then I have the space I deleted at the end filled with unusable data(I could fill it up with zeros or not). I likely would shrink the file afterwards.
I know I can read the whole data that I want to keep in a buffer and put it into a new file, but I dislike it to rewrite the whole file for deleting just one chunk.
Any ideas for the best way of removing data in a file?
#MarkSetchell: Had a good idea how to threat that problem:
I now have a magic number at the beginning from every chunk to check whether there is an other valid chunk comming. After moving some data towards the beginning, I move the writer-pointer right behind the last chunk and fill the space for the next magic number with zeros. So when listing up the entries it will stop when there is no valid magic number and if I add an other entry it will automatically override the unused space.
I know I can read the whole data that I want to keep in a buffer and put it into a new file, but I dislike it to rewrite the whole file for deleting just one chunk.
Any ideas for the best way of removing data in a file?
You can't have the best of both worlds. If you want to preserve space, you will need something to describe the file sections (lets call it an allocation table), with each file sections consisting of sequence of shards).
A section would start of normally (one shard), but as soon as it is de-allocated, the de-allocated section will be made available as part of a shard for a new section. One can now choose at what point in time you are willing to live with sharded (non-contiguous) sections (perhaps only after your file reaches a certain size limit).
The allocation table describes each section as a serious (link list) of shards (or one shard, if contiguous). One could either preserve a fixed size for the allocation table, or have it in a different file, or shard it and give it the ability to reconstruct itself.
struct Section
{
struct Shard
{
std::size_t baseAddr_;
std::size_t size_;
};
std::string name_;
std::size_t shardCount_;
std::vector<Shard> shards_;
istream& readFrom( std::istream& );
};
struct AllocTable
{
std::size_t sectionCount_;
std::vector<Section> sections_;
std::size_t next_;
istream& readFrom( std::istream& is, AllocTable* previous )
{
//Brief code... error handling left as your exercise
is >> sectionCount_;
sections_.resize( sectionCount_ );
for( std::size_t i = 0; i < sectionCount_; ++i )
{
sections_[i].readFrom( is );
}
is >> next_; //Note - no error handling for brevity
if( next_ != static_cast<std::size_t>(-1) )
{
is.seekg( next_ ); //Seek to next_ from file beginning
AllocTable nextTable;
nextTable.readFrom( is, this );
sections_.insert( sections_.end(),
nextTable.sections_.begin(), table_.sections_.end() );
}
return is;
}
};
...
I need to implement a LRU algorithm in a 3D renderer for texture caching. I write the code in C++ on Linux.
In my case I will use texture caching to store "tiles" of image data (16x16 pixels block). Now imagine that I do a lookup in the cache, get a hit (tile is in the cache). How do I return the content of the "cache" for that entry to the function caller? I explain. I imagine that when I load a tile in the cache memory, I allocate the memory to store 16x16 pixels for example, then load the image data for that tile. Now there's two solutions to pass the content of the cache entry to the function caller:
1) either as pointer to the tile data (fast, memory efficient),
TileData *tileData = cache->lookup(tileId); // not safe?
2) or I need to recopy the tile data from the cache within a memory space allocated by the function caller (copy can be slow).
void Cache::lookup(int tileId, float *&tileData)
{
// find tile in cache, if not in cache load from disk add to cache, ...
...
// now copy tile data, safe but ins't that slow?
memcpy((char*)tileData, tileDataFromCache, sizeof(float) * 3 * 16 * 16);
}
float *tileData = new float[3 * 16 * 16]; // need to allocate the memory for that tile
// get tile data from cache, requires a copy
cache->lookup(tileId, tileData);
I would go with 1) but the problem is, what happens if the tile gets deleted from the cache just after the lookup, and that the function tries to access the data using the return pointer? The only solution I see to this, is to use a form of referencing counting (auto_ptr) where the data is actually only deleted when it's not used anymore?
the application might access more than 1 texture. I can't seem to find of a way of creating a key which is unique to each texture and each tile of a texture. For example I may have tile 1 from file1 and tile1 from file2 in the cache, so making the search on tildId=1 is not enough... but I can't seem to find a way of creating the key that accounts for the file name and the tileID. I can build a string that would contain the file name and the tileID (FILENAME_TILEID) but wouldn't a string used as a key be much slower than an integer?
Finally I have a question regarding time stamp. Many papers suggest to use a time stamp for ordering the entry in the cache. What is a good function to use a time stamp? the time() function, clock()? Is there a better way than using time stamps?
Sorry I realise it's a very long message, but LRU doesn't seem as simple to implement than it sounds.
Answers to your questions:
1) Return a shared_ptr (or something logically equivalent to it). Then all of the "when-is-it-safe-to-delete-this-object" issues pretty much go away.
2) I'd start by using a string as a key, and see if it actually is too slow or not. If the strings aren't too long (e.g. your filenames aren't too long) then you may find it's faster than you expect. If you do find out that string-keys aren't efficient enough, you could try something like computing a hashcode for the string and adding the tile ID to it... that would probably work in practice although there would always be the possibility of a hash-collision. But you could have a collision-check routine run at startup that would generate all of the possible filename+tileID combinations and alert you if map to the same key value, so that at least you'd know immediately during your testing when there is a problem and could do something about it (e.g. by adjusting your filenames and/or your hashcode algorithm). This assumes that what all the filenames and tile IDs are going to be known in advance, of course.
3) I wouldn't recommend using a timestamp, it's unnecessary and fragile. Instead, try something like this (pseudocode):
typedef shared_ptr<TileData *> TileDataPtr; // automatic memory management!
linked_list<TileDataPtr> linkedList;
hash_map<data_key_t, TileDataPtr> hashMap;
// This is the method the calling code would call to get its tile data for a given key
TileDataPtr GetData(data_key_t theKey)
{
if (hashMap.contains_key(theKey))
{
// The desired data is already in the cache, great! Just move it to the head
// of the LRU list (to reflect its popularity) and then return it.
TileDataPtr ret = hashMap.get(theKey);
linkedList.remove(ret); // move this item to the head
linkedList.push_front(ret); // of the linked list -- this is O(1)/fast
return ret;
}
else
{
// Oops, the requested object was not in our cache, load it from disk or whatever
TileDataPtr ret = LoadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.put(theKey, ret);
// Don't let our cache get too large -- delete
// the least-recently-used item if necessary
if (linkedList.size() > MAX_LRU_CACHE_SIZE)
{
TileDataPtr dropMe = linkedList.tail();
hashMap.remove(dropMe->GetKey());
linkedList.remove(dropMe);
}
return ret;
}
}
In the same order as your questions:
Copying over the texture date does not seem reasonable from a performance standpoint. Reference counting sound far better, as long as you can actually code it safely. The data memory would be freed as soon as it is not used by the renderer or have a reference stored in the cache.
I assume that you are going to use some sort of hash table for the look-up part of what you are describing. The common solution to your problem has two parts:
Using a suitable hashing function that combines multiple values e.g. the texture file name and the tile ID. Essentially you create a composite key that is treated as one entity. The hashing function could be a XOR operation of the hashes of all elementary components, or something more complex.
Selecting a suitable hash function is critical for performance reasons - if the said function is not random enough, you will have a lot of hash collisions.
Using a suitable composite equality check to handle the case of hash collisions.
This way you can look-up the combination of all attributes of interest in a single hash table look-up.
Using timestamps for this is not going to work - period. Most sources regarding caching usually describe the algorithms in question with network resource caching in mind (e.g. HTTP caches). That is not going to work here for three reasons:
Using natural time only makes sense of you intend to implement caching policies that take it into account, e.g. dropping a cache entry after 10 minutes. Unless you are doing something very weird something like this makes no sense within a 3D renderer.
Timestamps have a relatively low actual resolution, even if you use high precision timers. Most timer sources have a precision of about 1ms, which is a very long time for a processor - in that time your renderer would have worked through several texture entries.
Do you have any idea how expensive timer calls are? Abusing them like this could even make your system perform worse than not having any cache at all...
The usual solution to this problem is to not use a timer at all. The LRU algorithm only needs to know two things:
The maximum number of entries allowed.
The order of the existing entries w.r.t. their last access.
Item (1) comes from the configuration of the system and typically depends on the available storage space. Item (2) generally implies the use of a combined linked list/hash table data structure, where the hash table part provides fast access and the linked list retains the access order. Each time an entry is accessed, it is placed at the end of the list, while old entries are removed from its start.
Using a combined data structure, rather than two separate ones allows entries to be removed from the hash table without having to go through a look-up operation. This improves the overall performance, but it is not absolutely necessary.
As promised I am posting my code. Please let me know if I have made mistakes or if I could improve it further. I am now going to look into making it work in a multi-threaded environment. Again thanks to Jeremy and Thkala for their help (sorry the code doesn't fit the comment block).
#include <cstdlib>
#include <cstdio>
#include <memory>
#include <list>
#include <unordered_map>
#include <cstdint>
#include <iostream>
typedef uint32_t data_key_t;
class TileData
{
public:
TileData(const data_key_t &key) : theKey(key) {}
data_key_t theKey;
~TileData() { std::cerr << "delete " << theKey << std::endl; }
};
typedef std::shared_ptr<TileData> TileDataPtr; // automatic memory management!
TileDataPtr loadDataFromDisk(const data_key_t &theKey)
{
return std::shared_ptr<TileData>(new TileData(theKey));
}
class CacheLRU
{
public:
// the linked list keeps track of the order in which the data was accessed
std::list<TileDataPtr> linkedList;
// the hash map (unordered_map is part of c++0x while hash_map isn't?) gives quick access to the data
std::unordered_map<data_key_t, TileDataPtr> hashMap;
CacheLRU() : cacheHit(0), cacheMiss(0) {}
TileDataPtr getData(data_key_t theKey)
{
std::unordered_map<data_key_t, TileDataPtr>::const_iterator iter = hashMap.find(theKey);
if (iter != hashMap.end()) {
TileDataPtr ret = iter->second;
linkedList.remove(ret);
linkedList.push_front(ret);
++cacheHit;
return ret;
}
else {
++cacheMiss;
TileDataPtr ret = loadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.insert(std::make_pair<data_key_t, TileDataPtr>(theKey, ret));
if (linkedList.size() > MAX_LRU_CACHE_SIZE) {
const TileDataPtr dropMe = linkedList.back();
hashMap.erase(dropMe->theKey);
linkedList.remove(dropMe);
}
return ret;
}
}
static const uint32_t MAX_LRU_CACHE_SIZE = 8;
uint32_t cacheMiss, cacheHit;
};
int main(int argc, char **argv)
{
CacheLRU cache;
for (uint32_t i = 0; i < 238; ++i) {
int key = random() % 32;
TileDataPtr tileDataPtr = cache.getData(key);
}
std::cerr << "Cache hit: " << cache.cacheHit << ", cache miss: " << cache.cacheMiss << std::endl;
return 0;
}
I am writing a function for getting datasets from a file and putting them into vectors. The datasets are then used in a calculation. In the file, a user writes each dataset on a line under a heading like 'Dataset1'. The result is i vectors by the time the function finishes executing. The function works just fine.
The problem is that I don't know how to get the vectors out of the function! (1) I think I can only return one entity from a function. So I can't return i vectors. Also, (2) I can't write the vectors/datasets as function parameters and return them by reference because the number of vectors/datasets is different for each calculation. If there are other possibilities, I am unaware of them.
I'm sure this is a silly question, but am I missing something here? I would be very grateful for any suggestions. Until now, I have not put the vector/dataset extraction code into a function; I have kept it in my main file, where it has worked fine. I would now like to clean up my code by putting all data extraction code into its own function.
For each calculation, I DO know the number of vectors/datasets that the function will find in the file because I have that information written in the file and can extract it. Is there some way I could use this information?
If each vector is of the same type you can return a
std::vector<std::vector<datatype> >
This would look like:
std::vector<std::vector<datatype> > function(arguments) {
std::vector<std::vector<datatype> > return_vector;
for(int i =0; i < rows; ++i) {
\\ do processing
return_vector.push_back(resulting_vector);
}
return return_vector;
}
As has been mentionned, you may simply use a vector of vectors.
In addition, you may want to add a smart pointer around it, just to make sure you're not copying the contents of your vectors (but that's already an improvement. First aim at something that works).
As for the information on the number of vectors, you may use it by resizing the global vector to the appropriate value.
You question is, at its essence "How do I return a pile of things from a function?" It happens that your things are vector<double>, but that's not really important. What is important is that you have a pile of them of unknown size.
You can refine your thinking by rephrasing your one question into two:
How do I represent a pile of things?
How do I return that representation from a function?
As to the first question, this is precisely what containers do. Containers, as you surely know because you are already using one, hold an arbitrary numbers of similar objects. Examples include std::vector<T> and std::list<T>, among others. Your choice of which container to use is dictated by circumstances you haven't mentioned -- for example, how expensive are the items to copy, do you need to delete an item from middle of the pile, etc.
In your specific case, knowing what little we know, it seems you should use std::vector<>. As you know, the template parameter is the type of the thing you want to store. In your case that happens to be (coincidentally), an std::vector<double>. (The fact that the container and its contained object happen to be similar types is of no consequence. If you need a pile of Blobs or Widgets, you say std::vector<Blob> or std::vector<Widget>. Since you need a pile of vector<double>s, you say vector<vector<double> >.) So you would declare it thus:
std::vector<std::vector<double > > myPile;
(Notice the space between > and >. That space is required in the previous C++ standard.)
You build up that vector just as you did your vector<double> -- either using generic algorithms, or invoking push_back, or some other way. So, your code would look like this:
void function( /* args */ ) {
std::vector<std::vector<double> > myPile;
while( /* some condition */ ) {
std::vector<double> oneLineOfData;
/* code to read in one vector */
myPile.push_back(oneLineOfData);
}
}
In this manner, you collect all of the incoming data into one structure, myPile.
As to the second question, how to return the data. Well, that's simple -- use a return statement.
std::vector<std::vector<double> > function( /* args */ ) {
std::vector<std::vector<double> > myPile;
/* All of the useful code goes here*/
return myPile;
}
Of course, you could also return the information via a passed-in reference to your vector:
void function( /* args */, std::vector<std::vector<double> >& myPile)
{
/* code goes here. including: */
myPile.push_back(oneLineOfData);
}
Or via a passed-in pointer to your vector:
void function( /* args */, std::vector<std::vector<double> >* myPile)
{
/* code goes here. */
myPile->push_back(oneLineOfData);
}
In both of those cases, the caller must create the vector-of-vector-of-double before invoking your function. Prefer the first (return) way, but if your program design dictates, you can use the other ways.