How to generate a hashmap for huge chunk of data?

How to generate a hashmap for huge chunk of data? - c++

I want to make a map such that a set of pointers point to arrays of dynamic size.
I did use hashing with chaining. But since data I am using it for is huge, the program give std::bad_alloc after few iterations. The reason of which may be new used to generate the linked list.
Someone please suggest which data structure shall I use?
Or anything else that can improve memory usage with my hash table?
Program is in C++.
This is what my code looks like:
Initialization of hashtable:
class Link
{
public:
double iData;
Link* pNext;
Link(double it) : iData(it)
{ }
void displayLink()
{ cout << iData << " "; }
};
class List
{
private:
Link* pFirst;
public:
List()
{ pFirst = NULL; }
void insert(double key)
{
if(pFirst==NULL)
pFirst = new Link(key);
else
{
Link* pLink = new Link(key);
pLink->pNext = pFirst;
pFirst = pLink;
}
}
};
class HashTable
{
public:
int arraySize;
vector<List*> hashArray;
HashTable(int size)
{
hashArray.resize(size);
for(int j=0; j<size; j++)
hashArray[j] = new List;
}
};
main snippet:
int t_sample = 1000;
for(int i=0; i < k; i++) // initialize random position
{
x[i] = (cal_rand() * dom_sizex); //dom_sizex = 20e-10 cal_rand() generates rand no between 0 and 1
y[i] = (cal_rand() * dom_sizey); //dom_sizey = 10e-10
}
for(int t=0; t < t_sample; t++)
{
int size;
size = cell_nox * cell_noy; //size of hash table cell_nox = 212, cell_noy = 424
HashTable theHashTable(size); //make table
int hashValue = 0;
for(int n=0; n<k; n++) // k = 10*212*424
{
int m = x[n] /cell_width; //cell_width = 4.7e-8
int l = y[n] / cell_width;
hashValue = (kx*l)+m;
theHashTable.hashArray[hashValue]->insert(n);
}
-------
-------
}

First things first, use a Standard Container. In your specific case, you might want:
either std::unordered_multimap<int, double>
or std::unordered_map<int, std::vector<double>>
(Note: if you do not have C++11, those are available in Boost)
Your main loop becomes (using the second option):
typedef std::unordered_map<int, std::vector<double>> HashTable;
for(int t = 0; t < t_sample; ++t)
{
size_t const size = cell_nox * cell_noy;
// size of hash table cell_nox = 212, cell_noy = 424
HashTable theHashTable;
theHashTable.reserve(size);
for (int n = 0; n < k; ++n) // k = 10*212*424
{
int m = x[n] / cell_width; //cell_width = 4.7e-8
int l = y[n] / cell_width;
int const cellId = (kx*l)+m;
theHashTable[cellId].push_back(n);
}
}
This will not leak memory (reliably), although of course you might have other leaks, and thus will give you a reliable baseline. It is also probably faster than your approach, with a more convenient interface, etc...
In general you should not re-invent the wheel, unless you have a specific need that is not addressed by the available wheels or you are actually trying to learn how to create a wheel or to create a better wheel.

The OS has to solve the same issues with the memory pages, maybe it's worth looking at how that is done? First of all, let's assume all pages are on the disk. A page is a fixed size memory chunk. For your use case, let's say it's an array of your records. Because RAM is limited, the OS maintains a mapping between the page number and it's location in RAM.
So, let's say your pages have 1000 records, and you want to access record 2024, you would ask the OS for page 2, and read record 24 from that page. That way, your map is only 1/1000 in size.
Now, if your page has no mapping to a memory location, then it is either on disk or has never been accessed before (is empty). Then you need to swap out another page, and load that page from disk (and update the location mapping).
This is a very simplified description of what happens and i wouldn't be surprised if someone jumps me in the neck for describing it like this.
The point is:
What does this mean for you?
First of all, your data exceeds your RAM - you won't get around writing to disk, if you don't want to try compression first.
Second, your chains can work as pages if you want, but i wonder whether just paging your hashcode would work better. What i mean is, use the upper bits as page number, and the lower bits as offset in the page. Avoiding collisions is still key, as you want to load the least pages possible. You can still chain your pages, and end up with a much smaller map.
Second - a crucial part is deciding which pages to swap out to make room for the new pages. LRU should do ok. If you can better predict which pages you will (not) need, so much better for you.
Third - you need placeholders for your pages to tell you whether they are in-memory or on disk.
Hope this helps.

Related

Create Dynamically Allocated Array C++

I'm trying to create a dynamically allocated array of type unsigned char* in C++. However, when I this I get back a string instead of a bracket enclosed ({}) array which is what I want.
unsigned char* arr = new unsigned char[arrLen];
Code Picture
Picture showing the difference between the two
You see how the latter doesn't just go to nothing after the first character? That's what I want.
How might I go about remedying this?
Thank you for your time.

First, de debugger assumes by default that char represents an ascii character rather than a number. It will display char as such.
arr2 has type const char[3] so the debugger knows there are 3 elements to display.
arr has type const char*. The debugger can't know if it's only one elements or an array with a certain number of elements.
If you are using visual studio for instance, you can hint the debugger to display three char by adding a “variable watch” with the syntax arr,3 in the watch menu.

I'm not sure if this is what you are looking for, but have you tried using a std::vector? It can handle the dynamic assignment you are looking for at least, and shouldn't treat a NULL character as the end of a string.
#include <vector>
std::vector<char> arr = { 0x5A, 0x00, 0x2B };

If you want a list of chars(array) that grows dynamically, what you need is a list of pointers where the list of each segment is a large number-say 1000. A vector container class sacrifices memory usage for the ability to grow.
vector container class allows for dynamic growth but uses a lot of memory
Also, dynamic growth one data element at a time is not recommended for a large list of data. If you want dynamic growth for a large list, create a list in chunks such as the following. Use a large list segment- of say 1000 units. I created 1000 lists in the following example. I do this by creating an array of 1000 pointers. This will create the 1 million chars you are looking for and can grow dynamically. The following example shows how you would do this.
.
void main() {
unsigned char* listsegment[1000];
int chrn=0;
int x, y = 0;
for (int x = 0; x < 1000; x++) {
listsegment[x] = new unsigned char[1000];
for (y = 0; y < 1000; y++) {
*(listsegment[x] + y) = chrn;
if (chrn >=255) chrn=0;
else chrn++;
}
}
}
Completing the program- What if more than 1000 segments need to be dynamically allocated?
Then create a list of Segment Sets. It can either be in a linked list or a in a container class.
Since the single set creates a 1000 segments of 1000 characters, a collection of these sets needs probably not be larger than 1000. A thousands sets would equal (1000*1000)*1000 which would equal one billion. Therefore, the collection would only need to be 1000 or less, which can be quickly iterated through-which makes random access for the collection not necessary.
Here is the program redone to support an infinite amount of sets through an infinitely large collection of sets. This also is a good example of segmented dynamic memory allocation in general.
#include <iostream>
#include<queue>
using namespace std;
struct listSegmentSetType {
unsigned char* listSegment[1000];
int count=0;
};
void main() {
listSegmentSetType listSegmentSet;
queue<listSegmentSetType> listSegmentSetCollection;
int numberOfListSegmentSets = 0;
int chrn = 0;
int x, y = 0;
listSegmentSet.count = 0;
for (int x = 0; x < 1000; x++) {
listSegmentSet.listSegment[x] = new unsigned char[1000];
for (y = 0; y < 1000; y++) {
*(listSegmentSet.listSegment[x] + y) = chrn;
if (chrn >= 255) chrn = 0;
else chrn++;
}
listSegmentSet.count++;
}
// add just completely filled out first list segment set to que
listSegmentSetCollection.push(listSegmentSet);
numberOfListSegmentSets++;
// now fill in second set of list segments-
listSegmentSet.count = 0;
for (int x = 0; x < 1000; x++) {
listSegmentSet.listSegment[x] = new unsigned char[1000];
for (y = 0; y < 1000; y++) {
*(listSegmentSet.listSegment[x] + y) = chrn;
if (chrn >= 255) chrn = 0;
else chrn++;
}
listSegmentSet.count++;
}
listSegmentSetCollection.push(listSegmentSet);
numberOfListSegmentSets++;
// now fill out any more sets of list segments and add to collection
// only when count listSegmentSet.count is no
// longer less than 1000.
}

modifying values in pointers is very slow?

I'm working with a huge amount of data stored in an array, and am trying to optimize the amount of time it takes to access and modify it. I'm using Window, c++ and VS2015 (Release mode).
I ran some tests and don't really understand the results I'm getting, so I would love some help optimizing my code.
First, let's say I have the following class:
class foo
{
public:
int x;
foo()
{
x = 0;
}
void inc()
{
x++;
}
int X()
{
return x;
}
void addX(int &_x)
{
_x++;
}
};
I start by initializing 10 million pointers to instances of that class into a std::vector of the same size.
#include <vector>
int count = 10000000;
std::vector<foo*> fooArr;
fooArr.resize(count);
for (int i = 0; i < count; i++)
{
fooArr[i] = new foo();
}
When I run the following code, and profile the amount of time it takes to complete, it takes approximately 350ms (which, for my purposes, is far too slow):
for (int i = 0; i < count; i++)
{
fooArr[i]->inc(); //increment all elements
}
To test how long it takes to increment an integer that many times, I tried:
int x = 0;
for (int i = 0; i < count; i++)
{
x++;
}
Which returns in <1ms.
I thought maybe the number of integers being changed was the problem, but the following code still takes 250ms, so I don't think it's that:
for (int i = 0; i < count; i++)
{
fooArr[0]->inc(); //only increment first element
}
I thought maybe the array index access itself was the bottleneck, but the following code takes <1ms to complete:
int x;
for (int i = 0; i < count; i++)
{
x = fooArr[i]->X(); //set x
}
I thought maybe the compiler was doing some hidden optimizations on the loop itself for the last example (since the value of x will be the same during each iteration of the loop, so maybe the compiler skips unnecessary iterations?). So I tried the following, and it takes 350ms to complete:
int x;
for (int i = 0; i < count; i++)
{
fooArr[i]->addX(x); //increment x inside foo function
}
So that one was slow again, but maybe only because I'm incrementing an integer with a pointer again.
I tried the following too, and it returns in 350ms as well:
for (int i = 0; i < count; i++)
{
fooArr[i]->x++;
}
So am I stuck here? Is ~350ms the absolute fastest that I can increment an integer, inside of 10million pointers in a vector? Or am I missing some obvious thing? I experimented with multithreading (giving each thread a different chunk of the array to increment) and that actually took longer once I started using enough threads. Maybe that was due to some other obvious thing I'm missing, so for now I'd like to stay away from multithreading to keep things simple.
I'm open to trying containers other than a vector too, if it speeds things up, but whatever container I end up using, I need to be able to easily resize it, remove elements, etc.
I'm fairly new to c++ so any help would be appreciated!

Let's look from the CPU point of view.
Incrementing an integer means I have it in a CPU register and just increments it. This is the fastest option.
I'm given an address (vector->member) and I must copy it to a register, increment, and copy the result back to the address. Worst: My CPU cache is filled with vector pointers, not with vector-member pointers. Too few hits, too much cache "refueling".
If I could manage to have all those members just in a vector, CPU cache hits would be much more frequent.

Try the following:
int count = 10000000;
std::vector<foo> fooArr;
fooArr.resize(count, foo());
for (auto it= fooArr.begin(); it != fooArr.end(); ++it) {
it->inc();
}
The new is killing you and actually you don't need it because resize inserts elements at the end if the size it's greater (check the docs: std::vector::resize)
And the other thing it's about using pointers which IMHO should be avoided until the last moment and it's uneccesary in this case. The performance should be a little bit faster in this case since you get better locality of your references (see cache locality). If they were polymorphic or something more complicated it might be different.

Vector performance suffering

I've been working on state space exploration and was originally using a map to store the assignment of the world states like map<Variable *, int>, where variables are objects in the world with a domain from 0 to n where n is finite. The implementation was extremely quick for performance, but I noticed that it does not scale well with the size of the state space. I changed the states to use vector<int> instead, where I use the id of a variable to find its index in the vector. Memory usage improved greatly, but the efficiency of the solver has tanked (gone from <30 seconds to 400+). The only code that I modified was generating the states and validating if the state is the goal. I can't figure out why using a vector has degraded performance, especially since the vector operations should only take linear time at worst.
Originally this is was how I generated nodes:
State * SuccessorGen::generate_successor(const Operator &op, map<Variable *, int> &var_assignment){
map<Variable *, int> values;
values.insert(var_assignment.begin(), var_assignment.end());
vector<Operator::Effect> effect = op.get_effect();
vector<Operator::Effect>::const_iterator eff_it = effect.begin();
for (; eff_it != effect.end(); eff_it++){
values[eff_it->var] = eff_it->after;
}
return new State(values);
}
And in my new implementation:
State* SuccessorGen::generate_successor(const Operator &op, const vector<int> &assignment){
vector<int> child;
child = assignment;
vector<Operator::Effect> effect = op.get_effect();
vector<Operator::Effect>::const_iterator eff_it = effect.begin();
for (; eff_it != effect.end(); eff_it++){
Variable *v = eff_it->var;
int id = v->get_id();
child[id] = eff_it->after;
}
return new State(child);
}
(The goal checking is similar, just looping over the goal assignment instead of operator effects.)
Are these vector operations really that much slower than using a map? Is there an equally efficient STL container I can use that has a lower overhead? The number of variables is relatively small (<50) and the vector never needs to be resized or modified after the for loop.
Edit:
I tried timing one loop through all the operators to see timing comparisons, with the effect list and assignment the vector version runs one loop in 0.3 seconds, while the map version is a little over 0.4 seconds. When I comment that section out the map was about the same, yet the vector jumped up to closer to 0.5 seconds. I added child.reserve(assignment.size()) but that did not make any change.
Edit 2:
From user63710's answer, I've also been digging through the rest of the code and noticed something really strange going on in the heuristic calculation. The vector version works fine, but for the map I use this line Node *n = new Node(i, transition.value, label_cost); open_list.push(n);, but once the loop finishes filling the queue the node gets totally screwed up. Nodes are a simple struct as:
struct Node{
// Source Value, Destination Value
int from;
int to;
int distance;
Node(int &f, int &t, int &d) : from(f), to(t), distance(d){}
};
Instead of having from, to, distance, it replaces from and to with id with some random number, and that search does not do what it should and is returning much faster then it should. When I tweak the map version to convert the map to a vector and run this:
Node n(i, transition.value, label_cost); open_list.push(n);
the performance is about equal to that of the vector. So that fixes my main issue, but this leaves me wondering why using Node *n gets this behaviour opposed to Node n()?

If as you say, the sizes of these structures are fairly small (~50 elements), I have to think that the issue is somewhere else. At least, I don't think it involves the memory accesses or allocation of the vector/map.
Some example code I made to test: Map version:
unique_ptr<map<int, int>> make_successor_map(const vector<int> &ids,
const map<int, int> &input)
{
auto new_map = make_unique<map<int, int>>(input.begin(), input.end());
for (size_t i = 0; i < ids.size(); ++i)
swap((*new_map)[ids[i]], (*new_map)[i]);
return new_map;
}
int main()
{
auto a_map = make_unique<map<int, int>>();
// ids to access
vector<int> ids;
const int n = 100;
for (int i = 0; i < n; ++i)
{
a_map->insert({i, rand()});
ids.push_back(i);
}
random_shuffle(ids.begin(), ids.end());
for (int i = 0; i < 1e6; ++i)
{
auto temp_map = make_successor_map(ids, *a_map);
swap(temp_map, a_map);
}
cout << a_map->begin()->second << endl;
}
Vector version:
unique_ptr<vector<int>> make_successor_vec(const vector<int> &ids,
const vector<int> &input)
{
auto new_vec = make_unique<vector<int>>(input);
for (size_t i = 0; i < ids.size(); ++i)
swap((*new_vec)[ids[i]], (*new_vec)[i]);
return new_vec;
}
int main()
{
auto a_vec = make_unique<vector<int>>();
// ids to access
vector<int> ids;
const int n = 100;
for (int i = 0; i < n; ++i)
{
a_vec->push_back(rand());
ids.push_back(i);
}
random_shuffle(ids.begin(), ids.end());
for (int i = 0; i < 1e6; ++i)
{
auto temp_vec = make_successor_vec(ids, *a_vec);
swap(temp_vec, a_vec);
}
cout << *a_vec->begin() << endl;
}
The map version takes around 15 seconds to run on my old Core 2 Duo T9600, and the vector version takes 0.406 seconds. Both we're compiled on G++ 4.9.2 with g++ -O3 --std=c++1y. So if your code takes 0.4s per iteration (note that it took my example code 0.4s for 1 million calls), then I'm really thinking your problem is somewhere else.
That's not to say you aren't having performance decreases due to switching from map->vector, but that the code you posted doesn't show much reason for that to happen.

The problem is that you create vectors without reserving space. Vectors store elements contiguously. That ensures constant access to elements.
So everytime you add an item to the vector (for example via your inserter), the vector has to reallocate more space and eventuelly move all the existing elements to a reallocated memory location. This causes slowdown and considerable heap fragmentation.
The solution to this is to reserve() elements if you know in advance how many elements you'll have. Or if you don't reserve() larger chunks and compare size() and capacity() to check if it's time to reserve more.

How to speed up a function that returns a pointer to object in c++?

I am a mechanical engineer so please understand I am not trained in proper coding. I have a finite element code that uses grids to make elements which make a model. The element is not important to this question so I have left it out. The elements and grids are read in from a file and that part works.
class Grid
{
private:
int id;
double x;
double y;
double z;
public:
Grid();
Grid(int, double, double, double);
int get_id() { return id;};
};
Grid::Grid() {};
Grid::Grid(int t_id, double t_x, double t_y double t_z)
{
id = t_id; x = t_x; y = t_y; z = t_z;
}
class SurfaceModel
{
private:
Grid** grids;
Element** elements;
int grid_count;
int elem_count;
public:
SurfaceModel();
SurfaceModel(int, int);
~SurfaceModel();
void read_grid(std::string);
int get_grid_count() { return grid_count; };
Grid* get_grid(int);
};
SurfaceModel::SurfaceModel()
{
grids = NULL;
elements = NULL;
}
SurfaceModel::SurfaceModel(int g, int e)
{
grids = new Grid*[g];
for (int i = 0; i < g; i++)
grids[i] = NULL;
elements = new Element*[e];
for (int i = 0; i < e; i++)
elements[i] = NULL;
}
void SurfaceModel::read_grid(std::string line)
{
... blah blah ...
grids[index] = new Grid(n_id, n_x, n_y, n_z);
... blah blah ....
}
Grid* SurfaceModel::get_grid(int i)
{
if (i < grid_count)
return grids[i];
else
return NULL;
}
When I need to actually use the grid I use the get_grid maybe something like this:
SurfaceModel model(...);
.... blah blah .....
for (int i = 0; i < model.get_grid_count(); i++)
{
Grid *cur_grid = model.get_grid(i);
int cur_id = cur_grid->get_id();
}
My problem is that the call to get_grid seems to be taking more time than I think it should to simply return my object. I have run the gprof on the code and found that get_grid gets called about 4 billion times when going through a very large simulation and another operation using the x, y, z occurs about the same. The operation does some multiplication. What I found is that the get_grid and math take about the same amount of time (~40 seconds). This seems like I have done something wrong. Is there a faster way to get that object out of there?

I think you're forgetting to set grid_count and elem_count.
This means, they will have uninitialized (indeterminate) values. If you loop for those values, you can easily end up looping a lot of iterations.
SurfaceModel::SurfaceModel()
: grid_count(0),
grids(NULL),
elem_count(0),
elements(NULL)
{
}
SurfaceModel::SurfaceModel(int g, int e)
: grid_count(g),
elem_count(e)
{
grids = new Grid*[g];
for (int i = 0; i < g; i++)
grids[i] = NULL;
elements = new Element*[e];
for (int i = 0; i < e; i++)
elements[i] = NULL;
}
Howeverm, I suggest you would want to get rid of each instance of new in this program (and use a vector for the grid)

On a modern CPU accessing memory often takes longer than doing multiplication. Getting good performance on modern systems can often mean focusing more on optimizing memory accesses than optimizing computation. Because you are storing your grid objects as an array of dynamically allocated pointers the grid objects themselves will be stored non-contiguously in memory and you will likely get many cache misses when trying to access them. In this example you would probably see a significant speedup by storing your grid objects directly in an array or vector since you will be accessing contiguous memory in your loop and so get good cache utilization and effective hardware prefetching.

4 billion times a microsecond (which is a pretty acceptable time in many cases) gives 4 000 seconds. And since you only get about 40 s (if I get it right), I doubt there's something seriously wrong here. If it's still slow for the task, I'd consider the use of parallel computing.

Trying to fill a 2d array of structures in C++

As above, I'm trying to create and then fill an array of structures with some starting data to then write to/read from.
I'm still writing the cache simulator as per my previous question:
Any way to get rid of the null character at the end of an istream get?
Here's how I'm making the array:
struct cacheline
{
string data;
string tag;
bool valid;
bool dirty;
};
cacheline **AllocateDynamicArray( int nRows, int nCols)
{
cacheline **dynamicArray;
dynamicArray = new cacheline*[nRows];
for( int i = 0 ; i < nRows ; i++ )
dynamicArray[i] = new cacheline [nCols];
return dynamicArray;
}
I'm calling this from main:
cacheline **cache = AllocateDynamicArray(nooflines,noofways);
It seems to create the array ok, but when I try to fill it I get memory errors, here's how I'm trying to do it:
int fillcache(cacheline **cache, int cachesize, int cachelinelength, int ways)
{
for (int j = 0; j < ways; j++)
{
for (int i = 0; i < cachesize/(cachelinelength*4); i++)
{
cache[i][ways].data = "EMPTY";
cache[i][ways].tag = "";
cache[i][ways].valid = 0;
cache[i][ways].dirty = 0;
}
}
return(1);
}
Calling it with:
fillcache(cache, cachesize, cachelinelength, noofways);
Now, this is the first time I've really tried to use dynamic arrays, so it's entirely possible I'm doing that completely wrong, let alone when trying to make it 2d, any ideas would be greatly appreciated :)
Also, is there an easier way to do write to/read from the array? At the moment (I think) I'm having to pass lots of variables to and from functions, including the array (or a pointer to the array?) each time which doesn't seem efficient?
Something else I'm unsure of, when I pass the array (pointer?) and edit the array, when I go back out of the function, will the array still be edited?
Thanks
Edit:
Just noticed a monumentally stupid error, it should ofcourse be:
cache[i][j].data = "EMPTY";

You should find your happiness. You just need the time to check it out (:
The way to happiness

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js